[jira] [Commented] (MESOS-3774) Migrate Future tests from process_tests.cpp to future_tests.cpp
[ https://issues.apache.org/jira/browse/MESOS-3774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15211404#comment-15211404 ] Anand Mazumdar commented on MESOS-3774: --- {code} commit 43a684e349bb9267a2562dd2716248397daf7197 Author: Cong Wang xiyou.wangc...@gmail.com Date: Thu Mar 10 13:17:32 2016 -0500 Moved future tests into `future_tests.cpp`. Review: https://reviews.apache.org/r/44026/ {code} > Migrate Future tests from process_tests.cpp to future_tests.cpp > --- > > Key: MESOS-3774 > URL: https://issues.apache.org/jira/browse/MESOS-3774 > Project: Mesos > Issue Type: Improvement >Reporter: Gilbert Song >Priority: Minor > Labels: mesosphere, newbie, testing > Fix For: 0.29.0 > > > Currently we do not have too much `FutureTest` in > /mesos/3rdparty/libprocess/src/tests/future_tests.cpp > It would be more clear to move all future-related tests > from: /mesos/3rdparty/libprocess/src/tests/process_tests.cpp > to: /mesos/3rdparty/libprocess/src/tests/future_tests.cpp -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5023) MesosContainerizerProvisionerTest.DestroyWhileProvisioning is flaky.
[ https://issues.apache.org/jira/browse/MESOS-5023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Klaus Ma updated MESOS-5023: Summary: MesosContainerizerProvisionerTest.DestroyWhileProvisioning is flaky. (was: MesosContainerizerProvisionerTest.ProvisionFailed is flaky.) > MesosContainerizerProvisionerTest.DestroyWhileProvisioning is flaky. > > > Key: MESOS-5023 > URL: https://issues.apache.org/jira/browse/MESOS-5023 > Project: Mesos > Issue Type: Bug >Reporter: Alexander Rukletsov >Assignee: Klaus Ma > Labels: mesosphere > > Observed on the Apache Jenkins. > {noformat} > [ RUN ] MesosContainerizerProvisionerTest.ProvisionFailed > I0324 13:38:56.284261 2948 containerizer.cpp:666] Starting container > 'test_container' for executor 'executor' of framework '' > I0324 13:38:56.285825 2939 containerizer.cpp:1421] Destroying container > 'test_container' > I0324 13:38:56.285854 2939 containerizer.cpp:1424] Waiting for the > provisioner to complete for container 'test_container' > [ OK ] MesosContainerizerProvisionerTest.ProvisionFailed (7 ms) > [ RUN ] MesosContainerizerProvisionerTest.DestroyWhileProvisioning > I0324 13:38:56.291187 2944 containerizer.cpp:666] Starting container > 'c2316963-c6cb-4c7f-a3b9-17ca5931e5b2' for executor 'executor' of framework '' > I0324 13:38:56.292157 2944 containerizer.cpp:1421] Destroying container > 'c2316963-c6cb-4c7f-a3b9-17ca5931e5b2' > I0324 13:38:56.292179 2944 containerizer.cpp:1424] Waiting for the > provisioner to complete for container 'c2316963-c6cb-4c7f-a3b9-17ca5931e5b2' > F0324 13:38:56.292899 2944 containerizer.cpp:752] Check failed: > containers_.contains(containerId) > *** Check failure stack trace: *** > @ 0x2ac9973d0ae4 google::LogMessage::Fail() > @ 0x2ac9973d0a30 google::LogMessage::SendToLog() > @ 0x2ac9973d0432 google::LogMessage::Flush() > @ 0x2ac9973d3346 google::LogMessageFatal::~LogMessageFatal() > @ 0x2ac996af897c > mesos::internal::slave::MesosContainerizerProcess::_launch() > @ 0x2ac996b1f18a > _ZZN7process8dispatchIbN5mesos8internal5slave25MesosContainerizerProcessERKNS1_11ContainerIDERK6OptionINS1_8TaskInfoEERKNS1_12ExecutorInfoERKSsRKS8_ISsERKNS1_7SlaveIDERKNS_3PIDINS3_5SlaveEEEbRKS8_INS3_13ProvisionInfoEES5_SA_SD_SsSI_SL_SQ_bSU_EENS_6FutureIT_EERKNSO_IT0_EEMS10_FSZ_T1_T2_T3_T4_T5_T6_T7_T8_T9_ET10_T11_T12_T13_T14_T15_T16_T17_T18_ENKUlPNS_11ProcessBaseEE_clES1P_ > @ 0x2ac996b479d9 > _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIbN5mesos8internal5slave25MesosContainerizerProcessERKNS5_11ContainerIDERK6OptionINS5_8TaskInfoEERKNS5_12ExecutorInfoERKSsRKSC_ISsERKNS5_7SlaveIDERKNS0_3PIDINS7_5SlaveEEEbRKSC_INS7_13ProvisionInfoEES9_SE_SH_SsSM_SP_SU_bSY_EENS0_6FutureIT_EERKNSS_IT0_EEMS14_FS13_T1_T2_T3_T4_T5_T6_T7_T8_T9_ET10_T11_T12_T13_T14_T15_T16_T17_T18_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_ > @ 0x2ac997334fef std::function<>::operator()() > @ 0x2ac99731b1c7 process::ProcessBase::visit() > @ 0x2ac997321154 process::DispatchEvent::visit() > @ 0x9a699c process::ProcessBase::serve() > @ 0x2ac9973173c0 process::ProcessManager::resume() > @ 0x2ac99731445a > _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_ > @ 0x2ac997320916 > _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE > @ 0x2ac9973208c6 > _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_ > @ 0x2ac997320858 > _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE > @ 0x2ac9973207af > _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEEclEv > @ 0x2ac997320748 > _ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv > @ 0x2ac9989aea60 (unknown) > @ 0x2ac999125182 start_thread > @ 0x2ac99943547d (unknown) > make[4]: Leaving directory `/mesos/mesos-0.29.0/_build/src' > make[4]: *** [check-local] Aborted > make[3]: *** [check-am] Error 2 > make[3]: Leaving directory `/mesos/mesos-0.29.0/_build/src' > make[2]: *** [check] Error 2 > make[2]: Leaving directory `/mesos/mesos-0.29.0/_build/src' > make[1]: *** [check-recursive] Error 1 > make[1]: Leaving directory `/mesos/mesos-0.29.0/_build' > make: *** [distcheck] Error 1 > Build step 'Execute
[jira] [Commented] (MESOS-3902) The Location header when non-leading master redirects to leading master is incomplete.
[ https://issues.apache.org/jira/browse/MESOS-3902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15211256#comment-15211256 ] Ashwin Murthy commented on MESOS-3902: -- Have sent u updated diff with final changes after validating this e2e with 3 masters and zk. works as expected. > The Location header when non-leading master redirects to leading master is > incomplete. > -- > > Key: MESOS-3902 > URL: https://issues.apache.org/jira/browse/MESOS-3902 > Project: Mesos > Issue Type: Bug > Components: HTTP API, master >Affects Versions: 0.25.0 > Environment: 3 masters, 10 slaves >Reporter: Ben Whitehead >Assignee: Ashwin Murthy > Labels: mesosphere > > The master now sets a location header, but it's incomplete. The path of the > URL isn't set. Consider an example: > {code} > > cat /tmp/subscribe-1072944352375841456 | httpp POST > > 127.1.0.3:5050/api/v1/scheduler Content-Type:application/x-protobuf > POST /api/v1/scheduler HTTP/1.1 > Accept: application/json > Accept-Encoding: gzip, deflate > Connection: keep-alive > Content-Length: 123 > Content-Type: application/x-protobuf > Host: 127.1.0.3:5050 > User-Agent: HTTPie/0.9.0 > +-+ > | NOTE: binary data not shown in terminal | > +-+ > HTTP/1.1 307 Temporary Redirect > Content-Length: 0 > Date: Fri, 26 Feb 2016 00:54:41 GMT > Location: //127.1.0.1:5050 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5028) Copy provisioner does not work for docker image layers with dangling symlink
[ https://issues.apache.org/jira/browse/MESOS-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15211203#comment-15211203 ] Zhitao Li commented on MESOS-5028: -- [~gilbert] and I took a quick, and this is actually caused by the layer trying to replace a directory with a symlink, which is not allowed by `cp -aT` (sorry my previous description was a bit misleading). > Copy provisioner does not work for docker image layers with dangling symlink > > > Key: MESOS-5028 > URL: https://issues.apache.org/jira/browse/MESOS-5028 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Zhitao Li >Assignee: Gilbert Song > > I'm trying to play with the new image provisioner on our custom docker > images, but one of layer failed to get copied, possibly due to a dangling > symlink. > Error log with Glog_v=1: > {quote} > I0324 05:42:48.926678 15067 copy.cpp:127] Copying layer path > '/tmp/mesos/store/docker/layers/5df0888641196b88dcc1b97d04c74839f02a73b8a194a79e134426d6a8fcb0f1/rootfs' > to rootfs > '/var/lib/mesos/provisioner/containers/5f05be6c-c970-4539-aa64-fd0eef2ec7ae/backends/copy/rootfses/507173f3-e316-48a3-a96e-5fdea9ffe9f6' > E0324 05:42:49.028506 15062 slave.cpp:3773] Container > '5f05be6c-c970-4539-aa64-fd0eef2ec7ae' for executor 'test' of framework > 75932a89-1514-4011-bafe-beb6a208bb2d-0004 failed to start: Collect failed: > Collect failed: Failed to copy layer: cp: cannot overwrite directory > ‘/var/lib/mesos/provisioner/containers/5f05be6c-c970-4539-aa64-fd0eef2ec7ae/backends/copy/rootfses/507173f3-e316-48a3-a96e-5fdea9ffe9f6/etc/apt’ > with non-directory > {quote} > Content of > _/tmp/mesos/store/docker/layers/5df0888641196b88dcc1b97d04c74839f02a73b8a194a79e134426d6a8fcb0f1/rootfs/etc/apt_ > points to a non-existing absolute path (cannot provide exact path but it's a > result of us trying to mount apt keys into docker container at build time). > I believe what happened is that we executed a script at build time, which > contains equivalent of: > {quote} > rm -rf /etc/apt/* && ln -sf /build-mount-point/ /etc/apt > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5028) Copy provisioner cannot replace directory with symlink
[ https://issues.apache.org/jira/browse/MESOS-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhitao Li updated MESOS-5028: - Summary: Copy provisioner cannot replace directory with symlink (was: Copy provisioner does not work for docker image layers with dangling symlink) > Copy provisioner cannot replace directory with symlink > -- > > Key: MESOS-5028 > URL: https://issues.apache.org/jira/browse/MESOS-5028 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Zhitao Li >Assignee: Gilbert Song > > I'm trying to play with the new image provisioner on our custom docker > images, but one of layer failed to get copied, possibly due to a dangling > symlink. > Error log with Glog_v=1: > {quote} > I0324 05:42:48.926678 15067 copy.cpp:127] Copying layer path > '/tmp/mesos/store/docker/layers/5df0888641196b88dcc1b97d04c74839f02a73b8a194a79e134426d6a8fcb0f1/rootfs' > to rootfs > '/var/lib/mesos/provisioner/containers/5f05be6c-c970-4539-aa64-fd0eef2ec7ae/backends/copy/rootfses/507173f3-e316-48a3-a96e-5fdea9ffe9f6' > E0324 05:42:49.028506 15062 slave.cpp:3773] Container > '5f05be6c-c970-4539-aa64-fd0eef2ec7ae' for executor 'test' of framework > 75932a89-1514-4011-bafe-beb6a208bb2d-0004 failed to start: Collect failed: > Collect failed: Failed to copy layer: cp: cannot overwrite directory > ‘/var/lib/mesos/provisioner/containers/5f05be6c-c970-4539-aa64-fd0eef2ec7ae/backends/copy/rootfses/507173f3-e316-48a3-a96e-5fdea9ffe9f6/etc/apt’ > with non-directory > {quote} > Content of > _/tmp/mesos/store/docker/layers/5df0888641196b88dcc1b97d04c74839f02a73b8a194a79e134426d6a8fcb0f1/rootfs/etc/apt_ > points to a non-existing absolute path (cannot provide exact path but it's a > result of us trying to mount apt keys into docker container at build time). > I believe what happened is that we executed a script at build time, which > contains equivalent of: > {quote} > rm -rf /etc/apt/* && ln -sf /build-mount-point/ /etc/apt > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-5023) MesosContainerizerProvisionerTest.ProvisionFailed is flaky.
[ https://issues.apache.org/jira/browse/MESOS-5023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Klaus Ma reassigned MESOS-5023: --- Assignee: Klaus Ma > MesosContainerizerProvisionerTest.ProvisionFailed is flaky. > --- > > Key: MESOS-5023 > URL: https://issues.apache.org/jira/browse/MESOS-5023 > Project: Mesos > Issue Type: Bug >Reporter: Alexander Rukletsov >Assignee: Klaus Ma > Labels: mesosphere > > Observed on the Apache Jenkins. > {noformat} > [ RUN ] MesosContainerizerProvisionerTest.ProvisionFailed > I0324 13:38:56.284261 2948 containerizer.cpp:666] Starting container > 'test_container' for executor 'executor' of framework '' > I0324 13:38:56.285825 2939 containerizer.cpp:1421] Destroying container > 'test_container' > I0324 13:38:56.285854 2939 containerizer.cpp:1424] Waiting for the > provisioner to complete for container 'test_container' > [ OK ] MesosContainerizerProvisionerTest.ProvisionFailed (7 ms) > [ RUN ] MesosContainerizerProvisionerTest.DestroyWhileProvisioning > I0324 13:38:56.291187 2944 containerizer.cpp:666] Starting container > 'c2316963-c6cb-4c7f-a3b9-17ca5931e5b2' for executor 'executor' of framework '' > I0324 13:38:56.292157 2944 containerizer.cpp:1421] Destroying container > 'c2316963-c6cb-4c7f-a3b9-17ca5931e5b2' > I0324 13:38:56.292179 2944 containerizer.cpp:1424] Waiting for the > provisioner to complete for container 'c2316963-c6cb-4c7f-a3b9-17ca5931e5b2' > F0324 13:38:56.292899 2944 containerizer.cpp:752] Check failed: > containers_.contains(containerId) > *** Check failure stack trace: *** > @ 0x2ac9973d0ae4 google::LogMessage::Fail() > @ 0x2ac9973d0a30 google::LogMessage::SendToLog() > @ 0x2ac9973d0432 google::LogMessage::Flush() > @ 0x2ac9973d3346 google::LogMessageFatal::~LogMessageFatal() > @ 0x2ac996af897c > mesos::internal::slave::MesosContainerizerProcess::_launch() > @ 0x2ac996b1f18a > _ZZN7process8dispatchIbN5mesos8internal5slave25MesosContainerizerProcessERKNS1_11ContainerIDERK6OptionINS1_8TaskInfoEERKNS1_12ExecutorInfoERKSsRKS8_ISsERKNS1_7SlaveIDERKNS_3PIDINS3_5SlaveEEEbRKS8_INS3_13ProvisionInfoEES5_SA_SD_SsSI_SL_SQ_bSU_EENS_6FutureIT_EERKNSO_IT0_EEMS10_FSZ_T1_T2_T3_T4_T5_T6_T7_T8_T9_ET10_T11_T12_T13_T14_T15_T16_T17_T18_ENKUlPNS_11ProcessBaseEE_clES1P_ > @ 0x2ac996b479d9 > _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIbN5mesos8internal5slave25MesosContainerizerProcessERKNS5_11ContainerIDERK6OptionINS5_8TaskInfoEERKNS5_12ExecutorInfoERKSsRKSC_ISsERKNS5_7SlaveIDERKNS0_3PIDINS7_5SlaveEEEbRKSC_INS7_13ProvisionInfoEES9_SE_SH_SsSM_SP_SU_bSY_EENS0_6FutureIT_EERKNSS_IT0_EEMS14_FS13_T1_T2_T3_T4_T5_T6_T7_T8_T9_ET10_T11_T12_T13_T14_T15_T16_T17_T18_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_ > @ 0x2ac997334fef std::function<>::operator()() > @ 0x2ac99731b1c7 process::ProcessBase::visit() > @ 0x2ac997321154 process::DispatchEvent::visit() > @ 0x9a699c process::ProcessBase::serve() > @ 0x2ac9973173c0 process::ProcessManager::resume() > @ 0x2ac99731445a > _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_ > @ 0x2ac997320916 > _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE > @ 0x2ac9973208c6 > _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_ > @ 0x2ac997320858 > _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE > @ 0x2ac9973207af > _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEEclEv > @ 0x2ac997320748 > _ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv > @ 0x2ac9989aea60 (unknown) > @ 0x2ac999125182 start_thread > @ 0x2ac99943547d (unknown) > make[4]: Leaving directory `/mesos/mesos-0.29.0/_build/src' > make[4]: *** [check-local] Aborted > make[3]: *** [check-am] Error 2 > make[3]: Leaving directory `/mesos/mesos-0.29.0/_build/src' > make[2]: *** [check] Error 2 > make[2]: Leaving directory `/mesos/mesos-0.29.0/_build/src' > make[1]: *** [check-recursive] Error 1 > make[1]: Leaving directory `/mesos/mesos-0.29.0/_build' > make: *** [distcheck] Error 1 > Build step 'Execute shell' marked build as failure > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5020) Drop `404 Not Found` and `307 Temporary Redirect` in the scheduler library.
[ https://issues.apache.org/jira/browse/MESOS-5020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15211124#comment-15211124 ] Yong Tang commented on MESOS-5020: -- Hi [~anandmazumdar] [~vinodkone], I created a review request: https://reviews.apache.org/r/45327/ Please take a look if you have time and let me know if there are any issues. > Drop `404 Not Found` and `307 Temporary Redirect` in the scheduler library. > --- > > Key: MESOS-5020 > URL: https://issues.apache.org/jira/browse/MESOS-5020 > Project: Mesos > Issue Type: Improvement >Reporter: Anand Mazumdar >Assignee: Yong Tang > Labels: mesosphere, newbie > > Currently, the scheduler library does not drop {{404 Not Found}} but treats > them as {{Event::ERROR}}. The library can receive this if the master has not > yet set up HTTP routes yet. The executor library already deals with this. > Secondly, in some cases, the {{detector}} can detect a new master without the > master realizing that it has been elected as the new master. In such cases, > the master responds with {{307 Temporary Redirect}}. We would like to drop > these status codes too instead of treating them as {{Event::ERROR}} too. > https://github.com/apache/mesos/blob/master/src/scheduler/scheduler.cpp#L547 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-5020) Drop `404 Not Found` and `307 Temporary Redirect` in the scheduler library.
[ https://issues.apache.org/jira/browse/MESOS-5020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yong Tang reassigned MESOS-5020: Assignee: Yong Tang > Drop `404 Not Found` and `307 Temporary Redirect` in the scheduler library. > --- > > Key: MESOS-5020 > URL: https://issues.apache.org/jira/browse/MESOS-5020 > Project: Mesos > Issue Type: Improvement >Reporter: Anand Mazumdar >Assignee: Yong Tang > Labels: mesosphere, newbie > > Currently, the scheduler library does not drop {{404 Not Found}} but treats > them as {{Event::ERROR}}. The library can receive this if the master has not > yet set up HTTP routes yet. The executor library already deals with this. > Secondly, in some cases, the {{detector}} can detect a new master without the > master realizing that it has been elected as the new master. In such cases, > the master responds with {{307 Temporary Redirect}}. We would like to drop > these status codes too instead of treating them as {{Event::ERROR}} too. > https://github.com/apache/mesos/blob/master/src/scheduler/scheduler.cpp#L547 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4759) Add network/cni isolator for Mesos containerizer.
[ https://issues.apache.org/jira/browse/MESOS-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1520#comment-1520 ] Jie Yu commented on MESOS-4759: --- commit 4bf5833e58324df65e48f607a0d2b73b56f23f40 Author: Qian ZhangDate: Thu Mar 24 16:09:26 2016 -0700 Implemented prepare() method of "network/cni" isolator. Review: https://reviews.apache.org/r/44514/ > Add network/cni isolator for Mesos containerizer. > - > > Key: MESOS-4759 > URL: https://issues.apache.org/jira/browse/MESOS-4759 > Project: Mesos > Issue Type: Task >Reporter: Jie Yu >Assignee: Qian Zhang > > See the design doc for more context (MESOS-4742). > The isolator will interact with CNI plugins to create the network for the > container to join. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3573) Mesos does not kill orphaned docker containers
[ https://issues.apache.org/jira/browse/MESOS-3573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15211051#comment-15211051 ] Ian Babrou commented on MESOS-3573: --- It shouldn't be, but it is. > The agent would itself kill the container the executor is running in after 2 > seconds (EXECUTOR_REREGISTER_TIMEOUT) Not always, just like I showed. > Of course, if the docker daemon is still stuck and the agent is not able to > invoke docker->stop on the container, it would fail. Docker is healthy, "docker stop" does not happen: https://issues.apache.org/jira/browse/MESOS-3573?focusedCommentId=15075015=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15075015 > Mesos does not kill orphaned docker containers > -- > > Key: MESOS-3573 > URL: https://issues.apache.org/jira/browse/MESOS-3573 > Project: Mesos > Issue Type: Bug > Components: docker, slave >Reporter: Ian Babrou >Assignee: Anand Mazumdar > Labels: mesosphere > > After upgrade to 0.24.0 we noticed hanging containers appearing. Looks like > there were changes between 0.23.0 and 0.24.0 that broke cleanup. > Here's how to trigger this bug: > 1. Deploy app in docker container. > 2. Kill corresponding mesos-docker-executor process > 3. Observe hanging container > Here are the logs after kill: > {noformat} > slave_1| I1002 12:12:59.362002 7791 docker.cpp:1576] Executor for > container 'f083aaa2-d5c3-43c1-b6ba-342de8829fa8' has exited > slave_1| I1002 12:12:59.362284 7791 docker.cpp:1374] Destroying > container 'f083aaa2-d5c3-43c1-b6ba-342de8829fa8' > slave_1| I1002 12:12:59.363404 7791 docker.cpp:1478] Running docker stop > on container 'f083aaa2-d5c3-43c1-b6ba-342de8829fa8' > slave_1| I1002 12:12:59.363876 7791 slave.cpp:3399] Executor > 'sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c' of framework > 20150923-122130-2153451692-5050-1- terminated with signal Terminated > slave_1| I1002 12:12:59.367570 7791 slave.cpp:2696] Handling status > update TASK_FAILED (UUID: 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task > sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework > 20150923-122130-2153451692-5050-1- from @0.0.0.0:0 > slave_1| I1002 12:12:59.367842 7791 slave.cpp:5094] Terminating task > sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c > slave_1| W1002 12:12:59.368484 7791 docker.cpp:986] Ignoring updating > unknown container: f083aaa2-d5c3-43c1-b6ba-342de8829fa8 > slave_1| I1002 12:12:59.368671 7791 status_update_manager.cpp:322] > Received status update TASK_FAILED (UUID: > 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task > sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework > 20150923-122130-2153451692-5050-1- > slave_1| I1002 12:12:59.368741 7791 status_update_manager.cpp:826] > Checkpointing UPDATE for status update TASK_FAILED (UUID: > 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task > sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework > 20150923-122130-2153451692-5050-1- > slave_1| I1002 12:12:59.370636 7791 status_update_manager.cpp:376] > Forwarding update TASK_FAILED (UUID: 4a1b2387-a469-4f01-bfcb-0d1cccbde550) > for task sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework > 20150923-122130-2153451692-5050-1- to the slave > slave_1| I1002 12:12:59.371335 7791 slave.cpp:2975] Forwarding the > update TASK_FAILED (UUID: 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task > sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework > 20150923-122130-2153451692-5050-1- to master@172.16.91.128:5050 > slave_1| I1002 12:12:59.371908 7791 slave.cpp:2899] Status update > manager successfully handled status update TASK_FAILED (UUID: > 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task > sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework > 20150923-122130-2153451692-5050-1- > master_1 | I1002 12:12:59.37204711 master.cpp:4069] Status update > TASK_FAILED (UUID: 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task > sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework > 20150923-122130-2153451692-5050-1- from slave > 20151002-120829-2153451692-5050-1-S0 at slave(1)@172.16.91.128:5051 > (172.16.91.128) > master_1 | I1002 12:12:59.37253411 master.cpp:4108] Forwarding status > update TASK_FAILED (UUID: 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task > sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework > 20150923-122130-2153451692-5050-1- > master_1 | I1002 12:12:59.37301811 master.cpp:5576] Updating the latest > state of task sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework > 20150923-122130-2153451692-5050-1- to TASK_FAILED > master_1 | I1002 12:12:59.37344711 hierarchical.hpp:814] Recovered > cpus(*):0.1; mem(*):16; ports(*):[31685-31685] (total:
[jira] [Updated] (MESOS-5018) FrameworkInfo Capability enum does not support upgrades.
[ https://issues.apache.org/jira/browse/MESOS-5018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-5018: --- Fix Version/s: 0.27.3 0.28.1 > FrameworkInfo Capability enum does not support upgrades. > > > Key: MESOS-5018 > URL: https://issues.apache.org/jira/browse/MESOS-5018 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.23.0, 0.23.1, 0.24.0, 0.24.1, 0.25.0, 0.26.0, 0.27.0, > 0.27.1, 0.28.0, 0.27.2, 0.26.1, 0.25.1 >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler > Fix For: 0.29.0, 0.28.1, 0.27.3 > > > See MESOS-4997 for the general issue around enum usage. This ticket tracks > fixing the FrameworkInfo Capability enum to support upgrades in a backwards > compatible way. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5029) Add labels to ExecutorInfo
[ https://issues.apache.org/jira/browse/MESOS-5029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhitao Li updated MESOS-5029: - Shepherd: Benjamin Mahler > Add labels to ExecutorInfo > -- > > Key: MESOS-5029 > URL: https://issues.apache.org/jira/browse/MESOS-5029 > Project: Mesos > Issue Type: Improvement >Reporter: Zhitao Li >Assignee: Zhitao Li >Priority: Minor > > We want to to allow frameworks to populate metadata on ExecutorInfo object. > An use case would be custom labels inspected by QosController. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5030) Expose TaskInfo's metadata to ResourceUsage struct
[ https://issues.apache.org/jira/browse/MESOS-5030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhitao Li updated MESOS-5030: - Shepherd: Benjamin Mahler > Expose TaskInfo's metadata to ResourceUsage struct > -- > > Key: MESOS-5030 > URL: https://issues.apache.org/jira/browse/MESOS-5030 > Project: Mesos > Issue Type: Improvement > Components: oversubscription >Reporter: Zhitao Li >Assignee: Zhitao Li > Labels: qos > > So QosController could use metadata information from TaskInfo. > Based on conversations from Mesos work group, we would at least include: > - task id; > - name; > - labels; > ( I think resources, kill_policy should probably also included). > Alternative would be just purge fields like `data`. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5014) Call and Event Type enums in scheduler.proto should be optional
[ https://issues.apache.org/jira/browse/MESOS-5014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210974#comment-15210974 ] Yong Tang commented on MESOS-5014: -- Hi [~vinodkone], I added a review request: https://reviews.apache.org/r/45317/ Let me know if there are any issues. > Call and Event Type enums in scheduler.proto should be optional > --- > > Key: MESOS-5014 > URL: https://issues.apache.org/jira/browse/MESOS-5014 > Project: Mesos > Issue Type: Improvement >Reporter: Vinod Kone >Assignee: Yong Tang > > Having a 'required' Type enum has backwards compatibility issues when adding > new enum types. See MESOS-4997 for details. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-5014) Call and Event Type enums in scheduler.proto should be optional
[ https://issues.apache.org/jira/browse/MESOS-5014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yong Tang reassigned MESOS-5014: Assignee: Yong Tang > Call and Event Type enums in scheduler.proto should be optional > --- > > Key: MESOS-5014 > URL: https://issues.apache.org/jira/browse/MESOS-5014 > Project: Mesos > Issue Type: Improvement >Reporter: Vinod Kone >Assignee: Yong Tang > > Having a 'required' Type enum has backwards compatibility issues when adding > new enum types. See MESOS-4997 for details. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-5030) Expose TaskInfo's metadata to ResourceUsage struct
Zhitao Li created MESOS-5030: Summary: Expose TaskInfo's metadata to ResourceUsage struct Key: MESOS-5030 URL: https://issues.apache.org/jira/browse/MESOS-5030 Project: Mesos Issue Type: Improvement Components: oversubscription Reporter: Zhitao Li Assignee: Zhitao Li So QosController could use metadata information from TaskInfo. Based on conversations from Mesos work group, we would at least include: - task id; - name; - labels; ( I think resources, kill_policy should probably also included). Alternative would be just purge fields like `data`. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-5029) Add labels to ExecutorInfo
Zhitao Li created MESOS-5029: Summary: Add labels to ExecutorInfo Key: MESOS-5029 URL: https://issues.apache.org/jira/browse/MESOS-5029 Project: Mesos Issue Type: Improvement Reporter: Zhitao Li Assignee: Zhitao Li Priority: Minor We want to to allow frameworks to populate metadata on ExecutorInfo object. An use case would be custom labels inspected by QosController. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3573) Mesos does not kill orphaned docker containers
[ https://issues.apache.org/jira/browse/MESOS-3573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210909#comment-15210909 ] Anand Mazumdar commented on MESOS-3573: --- [~bobrik] This shouldn't be a problem. The transient error that you are linking to happens due to this: When the agent is recovering, it tries to send a {{ReconnectExecutorMessage}} to reconnect with the executor. If for some reason, it fails like in your logs, probably due to the executor process itself being hung or executor process already exited. The agent would itself kill the container the executor is running in after 2 seconds (EXECUTOR_REREGISTER_TIMEOUT): https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L4700 Of course, if the docker daemon is still stuck and the agent is not able to invoke {{docker->stop}} on the container, it would fail. We cannot do anything about that as remarked in 1. of my earlier comment. Let me know if you have any further queries. > Mesos does not kill orphaned docker containers > -- > > Key: MESOS-3573 > URL: https://issues.apache.org/jira/browse/MESOS-3573 > Project: Mesos > Issue Type: Bug > Components: docker, slave >Reporter: Ian Babrou >Assignee: Anand Mazumdar > Labels: mesosphere > > After upgrade to 0.24.0 we noticed hanging containers appearing. Looks like > there were changes between 0.23.0 and 0.24.0 that broke cleanup. > Here's how to trigger this bug: > 1. Deploy app in docker container. > 2. Kill corresponding mesos-docker-executor process > 3. Observe hanging container > Here are the logs after kill: > {noformat} > slave_1| I1002 12:12:59.362002 7791 docker.cpp:1576] Executor for > container 'f083aaa2-d5c3-43c1-b6ba-342de8829fa8' has exited > slave_1| I1002 12:12:59.362284 7791 docker.cpp:1374] Destroying > container 'f083aaa2-d5c3-43c1-b6ba-342de8829fa8' > slave_1| I1002 12:12:59.363404 7791 docker.cpp:1478] Running docker stop > on container 'f083aaa2-d5c3-43c1-b6ba-342de8829fa8' > slave_1| I1002 12:12:59.363876 7791 slave.cpp:3399] Executor > 'sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c' of framework > 20150923-122130-2153451692-5050-1- terminated with signal Terminated > slave_1| I1002 12:12:59.367570 7791 slave.cpp:2696] Handling status > update TASK_FAILED (UUID: 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task > sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework > 20150923-122130-2153451692-5050-1- from @0.0.0.0:0 > slave_1| I1002 12:12:59.367842 7791 slave.cpp:5094] Terminating task > sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c > slave_1| W1002 12:12:59.368484 7791 docker.cpp:986] Ignoring updating > unknown container: f083aaa2-d5c3-43c1-b6ba-342de8829fa8 > slave_1| I1002 12:12:59.368671 7791 status_update_manager.cpp:322] > Received status update TASK_FAILED (UUID: > 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task > sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework > 20150923-122130-2153451692-5050-1- > slave_1| I1002 12:12:59.368741 7791 status_update_manager.cpp:826] > Checkpointing UPDATE for status update TASK_FAILED (UUID: > 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task > sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework > 20150923-122130-2153451692-5050-1- > slave_1| I1002 12:12:59.370636 7791 status_update_manager.cpp:376] > Forwarding update TASK_FAILED (UUID: 4a1b2387-a469-4f01-bfcb-0d1cccbde550) > for task sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework > 20150923-122130-2153451692-5050-1- to the slave > slave_1| I1002 12:12:59.371335 7791 slave.cpp:2975] Forwarding the > update TASK_FAILED (UUID: 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task > sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework > 20150923-122130-2153451692-5050-1- to master@172.16.91.128:5050 > slave_1| I1002 12:12:59.371908 7791 slave.cpp:2899] Status update > manager successfully handled status update TASK_FAILED (UUID: > 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task > sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework > 20150923-122130-2153451692-5050-1- > master_1 | I1002 12:12:59.37204711 master.cpp:4069] Status update > TASK_FAILED (UUID: 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task > sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework > 20150923-122130-2153451692-5050-1- from slave > 20151002-120829-2153451692-5050-1-S0 at slave(1)@172.16.91.128:5051 > (172.16.91.128) > master_1 | I1002 12:12:59.37253411 master.cpp:4108] Forwarding status > update TASK_FAILED (UUID: 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task > sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework > 20150923-122130-2153451692-5050-1- > master_1 | I1002 12:12:59.37301811 master.cpp:5576]
[jira] [Updated] (MESOS-5028) Copy provisioner does not work for docker image layers with dangling symlink
[ https://issues.apache.org/jira/browse/MESOS-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Yu updated MESOS-5028: -- Assignee: Gilbert Song (was: Jie Yu) > Copy provisioner does not work for docker image layers with dangling symlink > > > Key: MESOS-5028 > URL: https://issues.apache.org/jira/browse/MESOS-5028 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Zhitao Li >Assignee: Gilbert Song > > I'm trying to play with the new image provisioner on our custom docker > images, but one of layer failed to get copied, possibly due to a dangling > symlink. > Error log with Glog_v=1: > {quote} > I0324 05:42:48.926678 15067 copy.cpp:127] Copying layer path > '/tmp/mesos/store/docker/layers/5df0888641196b88dcc1b97d04c74839f02a73b8a194a79e134426d6a8fcb0f1/rootfs' > to rootfs > '/var/lib/mesos/provisioner/containers/5f05be6c-c970-4539-aa64-fd0eef2ec7ae/backends/copy/rootfses/507173f3-e316-48a3-a96e-5fdea9ffe9f6' > E0324 05:42:49.028506 15062 slave.cpp:3773] Container > '5f05be6c-c970-4539-aa64-fd0eef2ec7ae' for executor 'test' of framework > 75932a89-1514-4011-bafe-beb6a208bb2d-0004 failed to start: Collect failed: > Collect failed: Failed to copy layer: cp: cannot overwrite directory > ‘/var/lib/mesos/provisioner/containers/5f05be6c-c970-4539-aa64-fd0eef2ec7ae/backends/copy/rootfses/507173f3-e316-48a3-a96e-5fdea9ffe9f6/etc/apt’ > with non-directory > {quote} > Content of > _/tmp/mesos/store/docker/layers/5df0888641196b88dcc1b97d04c74839f02a73b8a194a79e134426d6a8fcb0f1/rootfs/etc/apt_ > points to a non-existing absolute path (cannot provide exact path but it's a > result of us trying to mount apt keys into docker container at build time). > I believe what happened is that we executed a script at build time, which > contains equivalent of: > {quote} > rm -rf /etc/apt/* && ln -sf /build-mount-point/ /etc/apt > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5028) Copy provisioner does not work for docker image layers with dangling symlink
[ https://issues.apache.org/jira/browse/MESOS-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Yu updated MESOS-5028: -- Shepherd: Jie Yu > Copy provisioner does not work for docker image layers with dangling symlink > > > Key: MESOS-5028 > URL: https://issues.apache.org/jira/browse/MESOS-5028 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Zhitao Li >Assignee: Gilbert Song > > I'm trying to play with the new image provisioner on our custom docker > images, but one of layer failed to get copied, possibly due to a dangling > symlink. > Error log with Glog_v=1: > {quote} > I0324 05:42:48.926678 15067 copy.cpp:127] Copying layer path > '/tmp/mesos/store/docker/layers/5df0888641196b88dcc1b97d04c74839f02a73b8a194a79e134426d6a8fcb0f1/rootfs' > to rootfs > '/var/lib/mesos/provisioner/containers/5f05be6c-c970-4539-aa64-fd0eef2ec7ae/backends/copy/rootfses/507173f3-e316-48a3-a96e-5fdea9ffe9f6' > E0324 05:42:49.028506 15062 slave.cpp:3773] Container > '5f05be6c-c970-4539-aa64-fd0eef2ec7ae' for executor 'test' of framework > 75932a89-1514-4011-bafe-beb6a208bb2d-0004 failed to start: Collect failed: > Collect failed: Failed to copy layer: cp: cannot overwrite directory > ‘/var/lib/mesos/provisioner/containers/5f05be6c-c970-4539-aa64-fd0eef2ec7ae/backends/copy/rootfses/507173f3-e316-48a3-a96e-5fdea9ffe9f6/etc/apt’ > with non-directory > {quote} > Content of > _/tmp/mesos/store/docker/layers/5df0888641196b88dcc1b97d04c74839f02a73b8a194a79e134426d6a8fcb0f1/rootfs/etc/apt_ > points to a non-existing absolute path (cannot provide exact path but it's a > result of us trying to mount apt keys into docker container at build time). > I believe what happened is that we executed a script at build time, which > contains equivalent of: > {quote} > rm -rf /etc/apt/* && ln -sf /build-mount-point/ /etc/apt > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-5028) Copy provisioner does not work for docker image layers with dangling symlink
Zhitao Li created MESOS-5028: Summary: Copy provisioner does not work for docker image layers with dangling symlink Key: MESOS-5028 URL: https://issues.apache.org/jira/browse/MESOS-5028 Project: Mesos Issue Type: Bug Components: containerization Reporter: Zhitao Li Assignee: Jie Yu I'm trying to play with the new image provisioner on our custom docker images, but one of layer failed to get copied, possibly due to a dangling symlink. Error log with Glog_v=1: {quote} I0324 05:42:48.926678 15067 copy.cpp:127] Copying layer path '/tmp/mesos/store/docker/layers/5df0888641196b88dcc1b97d04c74839f02a73b8a194a79e134426d6a8fcb0f1/rootfs' to rootfs '/var/lib/mesos/provisioner/containers/5f05be6c-c970-4539-aa64-fd0eef2ec7ae/backends/copy/rootfses/507173f3-e316-48a3-a96e-5fdea9ffe9f6' E0324 05:42:49.028506 15062 slave.cpp:3773] Container '5f05be6c-c970-4539-aa64-fd0eef2ec7ae' for executor 'test' of framework 75932a89-1514-4011-bafe-beb6a208bb2d-0004 failed to start: Collect failed: Collect failed: Failed to copy layer: cp: cannot overwrite directory ‘/var/lib/mesos/provisioner/containers/5f05be6c-c970-4539-aa64-fd0eef2ec7ae/backends/copy/rootfses/507173f3-e316-48a3-a96e-5fdea9ffe9f6/etc/apt’ with non-directory {quote} Content of _/tmp/mesos/store/docker/layers/5df0888641196b88dcc1b97d04c74839f02a73b8a194a79e134426d6a8fcb0f1/rootfs/etc/apt_ points to a non-existing absolute path (cannot provide exact path but it's a result of us trying to mount apt keys into docker container at build time). I believe what happened is that we executed a script at build time, which contains equivalent of: {quote} rm -rf /etc/apt/* && ln -sf /build-mount-point/ /etc/apt {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3573) Mesos does not kill orphaned docker containers
[ https://issues.apache.org/jira/browse/MESOS-3573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210830#comment-15210830 ] Ian Babrou commented on MESOS-3573: --- Cleanup is not always successful, there might be transient errors: https://issues.apache.org/jira/browse/MESOS-3573?focusedCommentId=15121914=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15121914 > Mesos does not kill orphaned docker containers > -- > > Key: MESOS-3573 > URL: https://issues.apache.org/jira/browse/MESOS-3573 > Project: Mesos > Issue Type: Bug > Components: docker, slave >Reporter: Ian Babrou >Assignee: Anand Mazumdar > Labels: mesosphere > > After upgrade to 0.24.0 we noticed hanging containers appearing. Looks like > there were changes between 0.23.0 and 0.24.0 that broke cleanup. > Here's how to trigger this bug: > 1. Deploy app in docker container. > 2. Kill corresponding mesos-docker-executor process > 3. Observe hanging container > Here are the logs after kill: > {noformat} > slave_1| I1002 12:12:59.362002 7791 docker.cpp:1576] Executor for > container 'f083aaa2-d5c3-43c1-b6ba-342de8829fa8' has exited > slave_1| I1002 12:12:59.362284 7791 docker.cpp:1374] Destroying > container 'f083aaa2-d5c3-43c1-b6ba-342de8829fa8' > slave_1| I1002 12:12:59.363404 7791 docker.cpp:1478] Running docker stop > on container 'f083aaa2-d5c3-43c1-b6ba-342de8829fa8' > slave_1| I1002 12:12:59.363876 7791 slave.cpp:3399] Executor > 'sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c' of framework > 20150923-122130-2153451692-5050-1- terminated with signal Terminated > slave_1| I1002 12:12:59.367570 7791 slave.cpp:2696] Handling status > update TASK_FAILED (UUID: 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task > sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework > 20150923-122130-2153451692-5050-1- from @0.0.0.0:0 > slave_1| I1002 12:12:59.367842 7791 slave.cpp:5094] Terminating task > sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c > slave_1| W1002 12:12:59.368484 7791 docker.cpp:986] Ignoring updating > unknown container: f083aaa2-d5c3-43c1-b6ba-342de8829fa8 > slave_1| I1002 12:12:59.368671 7791 status_update_manager.cpp:322] > Received status update TASK_FAILED (UUID: > 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task > sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework > 20150923-122130-2153451692-5050-1- > slave_1| I1002 12:12:59.368741 7791 status_update_manager.cpp:826] > Checkpointing UPDATE for status update TASK_FAILED (UUID: > 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task > sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework > 20150923-122130-2153451692-5050-1- > slave_1| I1002 12:12:59.370636 7791 status_update_manager.cpp:376] > Forwarding update TASK_FAILED (UUID: 4a1b2387-a469-4f01-bfcb-0d1cccbde550) > for task sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework > 20150923-122130-2153451692-5050-1- to the slave > slave_1| I1002 12:12:59.371335 7791 slave.cpp:2975] Forwarding the > update TASK_FAILED (UUID: 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task > sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework > 20150923-122130-2153451692-5050-1- to master@172.16.91.128:5050 > slave_1| I1002 12:12:59.371908 7791 slave.cpp:2899] Status update > manager successfully handled status update TASK_FAILED (UUID: > 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task > sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework > 20150923-122130-2153451692-5050-1- > master_1 | I1002 12:12:59.37204711 master.cpp:4069] Status update > TASK_FAILED (UUID: 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task > sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework > 20150923-122130-2153451692-5050-1- from slave > 20151002-120829-2153451692-5050-1-S0 at slave(1)@172.16.91.128:5051 > (172.16.91.128) > master_1 | I1002 12:12:59.37253411 master.cpp:4108] Forwarding status > update TASK_FAILED (UUID: 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task > sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework > 20150923-122130-2153451692-5050-1- > master_1 | I1002 12:12:59.37301811 master.cpp:5576] Updating the latest > state of task sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework > 20150923-122130-2153451692-5050-1- to TASK_FAILED > master_1 | I1002 12:12:59.37344711 hierarchical.hpp:814] Recovered > cpus(*):0.1; mem(*):16; ports(*):[31685-31685] (total: cpus(*):4; > mem(*):1001; disk(*):52869; ports(*):[31000-32000], allocated: > cpus(*):8.32667e-17) on slave 20151002-120829-2153451692-5050-1-S0 from > framework 20150923-122130-2153451692-5050-1- > {noformat} > Another issue: if you restart mesos-slave on the host with orphaned docker > containers,
[jira] [Commented] (MESOS-3573) Mesos does not kill orphaned docker containers
[ https://issues.apache.org/jira/browse/MESOS-3573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210821#comment-15210821 ] Anand Mazumdar commented on MESOS-3573: --- There are a few things at play here: A container can be orphaned due to any of the following: 1. The agent sends a explicit {{Shutdown}} request to the docker executor. The docker executor in turn executes a {{docker->stop}} which hangs due to some issues with the docker daemon itself. We currently just kill the {{mesos-docker-executor}} process if it's a command executor or the executor process if it's a custom executor. The actual container associated is now an orphan. We would kill all such orphaned containers when the agent process starts up during the recovering phase. We can't do anything more then that for this scenario. 2. A container can also be orphaned due to the following: - The agent either gets partitioned off from the master. The master sends it an explicit {{Shutdown}} request that instructs the agent to commit suicide after killing all it's tasks. Some of the containers can now be orphans due to 1. The agent process now starts off as a fresh instance with a new {{SlaveID}}. - The agent process is accidentally started with a new {{SlaveID}} due to specifying an incorrect {{work_dir}}. In both the above cases for 2., all the existing containers would be treated as orphans. Since, as of now, we only try to invoke {{docker ps}} for all containers that match the current {{SlaveID}}. We should ideally be killing all other containers that don't match the current {{SlaveID}} but have the {{mesos-}} prefix. > Mesos does not kill orphaned docker containers > -- > > Key: MESOS-3573 > URL: https://issues.apache.org/jira/browse/MESOS-3573 > Project: Mesos > Issue Type: Bug > Components: docker, slave >Reporter: Ian Babrou >Assignee: Anand Mazumdar > Labels: mesosphere > > After upgrade to 0.24.0 we noticed hanging containers appearing. Looks like > there were changes between 0.23.0 and 0.24.0 that broke cleanup. > Here's how to trigger this bug: > 1. Deploy app in docker container. > 2. Kill corresponding mesos-docker-executor process > 3. Observe hanging container > Here are the logs after kill: > {noformat} > slave_1| I1002 12:12:59.362002 7791 docker.cpp:1576] Executor for > container 'f083aaa2-d5c3-43c1-b6ba-342de8829fa8' has exited > slave_1| I1002 12:12:59.362284 7791 docker.cpp:1374] Destroying > container 'f083aaa2-d5c3-43c1-b6ba-342de8829fa8' > slave_1| I1002 12:12:59.363404 7791 docker.cpp:1478] Running docker stop > on container 'f083aaa2-d5c3-43c1-b6ba-342de8829fa8' > slave_1| I1002 12:12:59.363876 7791 slave.cpp:3399] Executor > 'sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c' of framework > 20150923-122130-2153451692-5050-1- terminated with signal Terminated > slave_1| I1002 12:12:59.367570 7791 slave.cpp:2696] Handling status > update TASK_FAILED (UUID: 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task > sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework > 20150923-122130-2153451692-5050-1- from @0.0.0.0:0 > slave_1| I1002 12:12:59.367842 7791 slave.cpp:5094] Terminating task > sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c > slave_1| W1002 12:12:59.368484 7791 docker.cpp:986] Ignoring updating > unknown container: f083aaa2-d5c3-43c1-b6ba-342de8829fa8 > slave_1| I1002 12:12:59.368671 7791 status_update_manager.cpp:322] > Received status update TASK_FAILED (UUID: > 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task > sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework > 20150923-122130-2153451692-5050-1- > slave_1| I1002 12:12:59.368741 7791 status_update_manager.cpp:826] > Checkpointing UPDATE for status update TASK_FAILED (UUID: > 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task > sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework > 20150923-122130-2153451692-5050-1- > slave_1| I1002 12:12:59.370636 7791 status_update_manager.cpp:376] > Forwarding update TASK_FAILED (UUID: 4a1b2387-a469-4f01-bfcb-0d1cccbde550) > for task sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework > 20150923-122130-2153451692-5050-1- to the slave > slave_1| I1002 12:12:59.371335 7791 slave.cpp:2975] Forwarding the > update TASK_FAILED (UUID: 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task > sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework > 20150923-122130-2153451692-5050-1- to master@172.16.91.128:5050 > slave_1| I1002 12:12:59.371908 7791 slave.cpp:2899] Status update > manager successfully handled status update TASK_FAILED (UUID: > 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task > sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework > 20150923-122130-2153451692-5050-1- >
[jira] [Created] (MESOS-5027) Enable authenticated login in the webui
Greg Mann created MESOS-5027: Summary: Enable authenticated login in the webui Key: MESOS-5027 URL: https://issues.apache.org/jira/browse/MESOS-5027 Project: Mesos Issue Type: Improvement Components: master, security, webui Reporter: Greg Mann The webui hits a number of endpoints to get the data that it displays: {{/state}}, {{/metrics/snapshot}}, {{/files/browse}}, {{/files/read}}, and maybe others? Once authentication is enabled on these endpoints, we need to add a login prompt to the webui so that users can provide credentials. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4951) Enable actors to pass an authentication realm to libprocess
[ https://issues.apache.org/jira/browse/MESOS-4951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann updated MESOS-4951: - Assignee: (was: Greg Mann) > Enable actors to pass an authentication realm to libprocess > --- > > Key: MESOS-4951 > URL: https://issues.apache.org/jira/browse/MESOS-4951 > Project: Mesos > Issue Type: Improvement > Components: libprocess, slave >Reporter: Greg Mann > Labels: authentication, http, mesosphere, security > > To prepare for MESOS-4902, the Mesos master and agent need a way to pass the > desired authentication realm to libprocess. Since some endpoints (like > {{/profiler/*}}) get installed in libprocess, the master/agent should be able > to specify during initialization what authentication realm the > libprocess-level endpoints will be authenticated under. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5025) FetcherCacheTest.LocalUncached is flaky
[ https://issues.apache.org/jira/browse/MESOS-5025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar updated MESOS-5025: -- Affects Version/s: 0.29.0 > FetcherCacheTest.LocalUncached is flaky > --- > > Key: MESOS-5025 > URL: https://issues.apache.org/jira/browse/MESOS-5025 > Project: Mesos > Issue Type: Bug > Components: fetcher >Affects Versions: 0.29.0 > Environment: CentOS 7 >Reporter: Anand Mazumdar > Labels: flaky, flaky-test > > Showed up on an internal CI: > {code} > [17:57:05] : [Step 11/11] [ RUN ] FetcherCacheTest.LocalUncached > [17:57:05]W: [Step 11/11] I0324 17:57:05.653718 1813 cluster.cpp:139] > Creating default 'local' authorizer > [17:57:05]W: [Step 11/11] I0324 17:57:05.659001 1813 leveldb.cpp:174] > Opened db in 5.09329ms > [17:57:05]W: [Step 11/11] I0324 17:57:05.660393 1813 leveldb.cpp:181] > Compacted db in 1.367077ms > [17:57:05]W: [Step 11/11] I0324 17:57:05.660434 1813 leveldb.cpp:196] > Created db iterator in 13516ns > [17:57:05]W: [Step 11/11] I0324 17:57:05.660446 1813 leveldb.cpp:202] > Seeked to beginning of db in 1531ns > [17:57:05]W: [Step 11/11] I0324 17:57:05.660454 1813 leveldb.cpp:271] > Iterated through 0 keys in the db in 284ns > [17:57:05]W: [Step 11/11] I0324 17:57:05.660478 1813 replica.cpp:779] > Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned > [17:57:05]W: [Step 11/11] I0324 17:57:05.660815 1831 recover.cpp:447] > Starting replica recovery > [17:57:05]W: [Step 11/11] I0324 17:57:05.661001 1831 recover.cpp:473] > Replica is in EMPTY status > [17:57:05]W: [Step 11/11] I0324 17:57:05.661866 1830 replica.cpp:673] > Replica in EMPTY status received a broadcasted recover request from > (1886)@172.30.2.131:51675 > [17:57:05]W: [Step 11/11] I0324 17:57:05.662237 1831 recover.cpp:193] > Received a recover response from a replica in EMPTY status > [17:57:05]W: [Step 11/11] I0324 17:57:05.662652 1827 recover.cpp:564] > Updating replica status to STARTING > [17:57:05]W: [Step 11/11] I0324 17:57:05.663151 1829 master.cpp:376] > Master 2574ed73-b254-4829-9efc-f76d89150396 (ip-172-30-2-131.mesosphere.io) > started on 172.30.2.131:51675 > [17:57:05]W: [Step 11/11] I0324 17:57:05.663172 1829 master.cpp:378] Flags > at startup: --acls="" --allocation_interval="1secs" > --allocator="HierarchicalDRF" --authenticate="true" > --authenticate_http="true" --authenticate_slaves="true" > --authenticators="crammd5" --authorizers="local" > --credentials="/tmp/VRasc1/credentials" --framework_sorter="drf" > --help="false" --hostname_lookup="true" --http_authenticators="basic" > --initialize_driver_logging="true" --log_auto_initialize="true" > --logbufsecs="0" --logging_level="INFO" --max_completed_frameworks="50" > --max_completed_tasks_per_framework="1000" --max_slave_ping_timeouts="5" > --quiet="false" --recovery_slave_removal_limit="100%" > --registry="replicated_log" --registry_fetch_timeout="1mins" > --registry_store_timeout="100secs" --registry_strict="true" > --root_submissions="true" --slave_ping_timeout="15secs" > --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" > --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/VRasc1/master" > --zk_session_timeout="10secs" > [17:57:05]W: [Step 11/11] I0324 17:57:05.663374 1829 master.cpp:427] > Master only allowing authenticated frameworks to register > [17:57:05]W: [Step 11/11] I0324 17:57:05.663384 1829 master.cpp:432] > Master only allowing authenticated slaves to register > [17:57:05]W: [Step 11/11] I0324 17:57:05.663390 1829 credentials.hpp:35] > Loading credentials for authentication from '/tmp/VRasc1/credentials' > [17:57:05]W: [Step 11/11] I0324 17:57:05.663595 1829 master.cpp:474] Using > default 'crammd5' authenticator > [17:57:05]W: [Step 11/11] I0324 17:57:05.663725 1829 master.cpp:545] Using > default 'basic' HTTP authenticator > [17:57:05]W: [Step 11/11] I0324 17:57:05.663880 1829 master.cpp:583] > Authorization enabled > [17:57:05]W: [Step 11/11] I0324 17:57:05.664001 1827 hierarchical.cpp:144] > Initialized hierarchical allocator process > [17:57:05]W: [Step 11/11] I0324 17:57:05.664010 1831 > whitelist_watcher.cpp:77] No whitelist given > [17:57:05]W: [Step 11/11] I0324 17:57:05.664114 1834 leveldb.cpp:304] > Persisting metadata (8 bytes) to leveldb took 1.271881ms > [17:57:05]W: [Step 11/11] I0324 17:57:05.664136 1834 replica.cpp:320] > Persisted replica status to STARTING > [17:57:05]W: [Step 11/11] I0324 17:57:05.664353 1833 recover.cpp:473] > Replica is in STARTING status > [17:57:05]W: [Step 11/11] I0324 17:57:05.665315 1833 replica.cpp:673] > Replica in STARTING status received a broadcasted recover request from >
[jira] [Commented] (MESOS-5015) Call and Event Type enums in executor.proto should be optional
[ https://issues.apache.org/jira/browse/MESOS-5015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210751#comment-15210751 ] Yong Tang commented on MESOS-5015: -- Hi [~vinodkone] Just created a review request: https://reviews.apache.org/r/45304/ Let me know if there are any issues. > Call and Event Type enums in executor.proto should be optional > -- > > Key: MESOS-5015 > URL: https://issues.apache.org/jira/browse/MESOS-5015 > Project: Mesos > Issue Type: Improvement >Reporter: Vinod Kone >Assignee: Yong Tang > > Having a 'required' Type enum has backwards compatibility issues when adding > new enum types. See MESOS-4997 for details. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-5015) Call and Event Type enums in executor.proto should be optional
[ https://issues.apache.org/jira/browse/MESOS-5015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yong Tang reassigned MESOS-5015: Assignee: Yong Tang > Call and Event Type enums in executor.proto should be optional > -- > > Key: MESOS-5015 > URL: https://issues.apache.org/jira/browse/MESOS-5015 > Project: Mesos > Issue Type: Improvement >Reporter: Vinod Kone >Assignee: Yong Tang > > Having a 'required' Type enum has backwards compatibility issues when adding > new enum types. See MESOS-4997 for details. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-5025) FetcherCacheTest.LocalUncached is flaky
Anand Mazumdar created MESOS-5025: - Summary: FetcherCacheTest.LocalUncached is flaky Key: MESOS-5025 URL: https://issues.apache.org/jira/browse/MESOS-5025 Project: Mesos Issue Type: Bug Components: fetcher Environment: CentOS 7 Reporter: Anand Mazumdar Showed up on an internal CI: {code} [17:57:05] : [Step 11/11] [ RUN ] FetcherCacheTest.LocalUncached [17:57:05]W: [Step 11/11] I0324 17:57:05.653718 1813 cluster.cpp:139] Creating default 'local' authorizer [17:57:05]W: [Step 11/11] I0324 17:57:05.659001 1813 leveldb.cpp:174] Opened db in 5.09329ms [17:57:05]W: [Step 11/11] I0324 17:57:05.660393 1813 leveldb.cpp:181] Compacted db in 1.367077ms [17:57:05]W: [Step 11/11] I0324 17:57:05.660434 1813 leveldb.cpp:196] Created db iterator in 13516ns [17:57:05]W: [Step 11/11] I0324 17:57:05.660446 1813 leveldb.cpp:202] Seeked to beginning of db in 1531ns [17:57:05]W: [Step 11/11] I0324 17:57:05.660454 1813 leveldb.cpp:271] Iterated through 0 keys in the db in 284ns [17:57:05]W: [Step 11/11] I0324 17:57:05.660478 1813 replica.cpp:779] Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned [17:57:05]W: [Step 11/11] I0324 17:57:05.660815 1831 recover.cpp:447] Starting replica recovery [17:57:05]W: [Step 11/11] I0324 17:57:05.661001 1831 recover.cpp:473] Replica is in EMPTY status [17:57:05]W: [Step 11/11] I0324 17:57:05.661866 1830 replica.cpp:673] Replica in EMPTY status received a broadcasted recover request from (1886)@172.30.2.131:51675 [17:57:05]W: [Step 11/11] I0324 17:57:05.662237 1831 recover.cpp:193] Received a recover response from a replica in EMPTY status [17:57:05]W: [Step 11/11] I0324 17:57:05.662652 1827 recover.cpp:564] Updating replica status to STARTING [17:57:05]W: [Step 11/11] I0324 17:57:05.663151 1829 master.cpp:376] Master 2574ed73-b254-4829-9efc-f76d89150396 (ip-172-30-2-131.mesosphere.io) started on 172.30.2.131:51675 [17:57:05]W: [Step 11/11] I0324 17:57:05.663172 1829 master.cpp:378] Flags at startup: --acls="" --allocation_interval="1secs" --allocator="HierarchicalDRF" --authenticate="true" --authenticate_http="true" --authenticate_slaves="true" --authenticators="crammd5" --authorizers="local" --credentials="/tmp/VRasc1/credentials" --framework_sorter="drf" --help="false" --hostname_lookup="true" --http_authenticators="basic" --initialize_driver_logging="true" --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" --max_slave_ping_timeouts="5" --quiet="false" --recovery_slave_removal_limit="100%" --registry="replicated_log" --registry_fetch_timeout="1mins" --registry_store_timeout="100secs" --registry_strict="true" --root_submissions="true" --slave_ping_timeout="15secs" --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/VRasc1/master" --zk_session_timeout="10secs" [17:57:05]W: [Step 11/11] I0324 17:57:05.663374 1829 master.cpp:427] Master only allowing authenticated frameworks to register [17:57:05]W: [Step 11/11] I0324 17:57:05.663384 1829 master.cpp:432] Master only allowing authenticated slaves to register [17:57:05]W: [Step 11/11] I0324 17:57:05.663390 1829 credentials.hpp:35] Loading credentials for authentication from '/tmp/VRasc1/credentials' [17:57:05]W: [Step 11/11] I0324 17:57:05.663595 1829 master.cpp:474] Using default 'crammd5' authenticator [17:57:05]W: [Step 11/11] I0324 17:57:05.663725 1829 master.cpp:545] Using default 'basic' HTTP authenticator [17:57:05]W: [Step 11/11] I0324 17:57:05.663880 1829 master.cpp:583] Authorization enabled [17:57:05]W: [Step 11/11] I0324 17:57:05.664001 1827 hierarchical.cpp:144] Initialized hierarchical allocator process [17:57:05]W: [Step 11/11] I0324 17:57:05.664010 1831 whitelist_watcher.cpp:77] No whitelist given [17:57:05]W: [Step 11/11] I0324 17:57:05.664114 1834 leveldb.cpp:304] Persisting metadata (8 bytes) to leveldb took 1.271881ms [17:57:05]W: [Step 11/11] I0324 17:57:05.664136 1834 replica.cpp:320] Persisted replica status to STARTING [17:57:05]W: [Step 11/11] I0324 17:57:05.664353 1833 recover.cpp:473] Replica is in STARTING status [17:57:05]W: [Step 11/11] I0324 17:57:05.665315 1833 replica.cpp:673] Replica in STARTING status received a broadcasted recover request from (1888)@172.30.2.131:51675 [17:57:05]W: [Step 11/11] I0324 17:57:05.665621 1827 recover.cpp:193] Received a recover response from a replica in STARTING status [17:57:05]W: [Step 11/11] I0324 17:57:05.666237 1833 master.cpp:1826] The newly elected leader is master@172.30.2.131:51675 with id 2574ed73-b254-4829-9efc-f76d89150396 [17:57:05]W: [Step
[jira] [Commented] (MESOS-3548) Investigate federations of Mesos masters
[ https://issues.apache.org/jira/browse/MESOS-3548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210687#comment-15210687 ] John Omernik commented on MESOS-3548: - This is a very interesting topic to me, and one that I would love to help test things on with Mesos as it evolves. > Investigate federations of Mesos masters > > > Key: MESOS-3548 > URL: https://issues.apache.org/jira/browse/MESOS-3548 > Project: Mesos > Issue Type: Improvement >Reporter: Neil Conway > Labels: federation, mesosphere, multi-dc > > In a large Mesos installation, the operator might want to ensure that even if > the Mesos masters are inaccessible or failed, new tasks can still be > scheduled (across multiple different frameworks). HA masters are only a > partial solution here: the masters might still be inaccessible due to a > correlated failure (e.g., Zookeeper misconfiguration/human error). > To support this, we could support the notion of "hierarchies" or > "federations" of Mesos masters. In a Mesos installation with 10k machines, > the operator might configure 10 Mesos masters (each of which might be HA) to > manage 1k machines each. Then an additional "meta-Master" would manage the > allocation of cluster resources to the 10 masters. Hence, the failure of any > individual master would impact 1k machines at most. The meta-master might not > have a lot of work to do: e.g., it might be limited to occasionally > reallocating cluster resources among the 10 masters, or ensuring that newly > added cluster resources are allocated among the masters as appropriate. > Hence, the failure of the meta-master would not prevent any of the individual > masters from scheduling new tasks. A single framework instance probably > wouldn't be able to use more resources than have been assigned to a single > Master, but that seems like a reasonable restriction. > This feature might also be a good fit for a multi-datacenter deployment of > Mesos: each Mesos master instance would manage a single DC. Naturally, > reducing the traffic between frameworks and the meta-master would be > important for performance reasons in a configuration like this. > Operationally, this might be simpler if Mesos processes were self-hosting > ([MESOS-3547]). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-4909) Introduce kill policy for tasks.
[ https://issues.apache.org/jira/browse/MESOS-4909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210609#comment-15210609 ] Alexander Rukletsov edited comment on MESOS-4909 at 3/24/16 5:36 PM: - {noformat} Commit: 7ab6a478b9cef548a2470d18bd281aee5610b62a [7ab6a47] Author: Alexander Rukletsov ruklet...@gmail.com Date: 24 Mar 2016 17:30:31 CET Committer: Alexander Rukletsov al...@apache.org Commit Date: 24 Mar 2016 18:21:03 CET Introduced KillPolicy protobuf. Describes a kill policy for a task. Currently does not express different policies (e.g. hitting HTTP endpoints), only controls how long to wait between graceful and forcible task kill. Review: https://reviews.apache.org/r/44656/ {noformat} {noformat} Commit: 1fe6221aa30f35f31378433412d8cb725009bd47 [1fe6221] Author: Alexander Rukletsov ruklet...@gmail.com Date: 24 Mar 2016 17:30:42 CET Committer: Alexander Rukletsov al...@apache.org Commit Date: 24 Mar 2016 18:21:03 CET Added validation for task's kill policy. Review: https://reviews.apache.org/r/44707/ {noformat} {noformat} Commit: d13de4c42b39037c8bd8f79122e7a9ac0d82317f [d13de4c] Author: Alexander Rukletsov ruklet...@gmail.com Date: 24 Mar 2016 17:30:52 CET Committer: Alexander Rukletsov al...@apache.org Commit Date: 24 Mar 2016 18:21:03 CET Used KillPolicy and shutdown grace period in command executor. The command executor determines how much time it allots the underlying task to clean up (effectively how long to wait for the task to comply to SIGTERM before sending SIGKILL) based on both optional task's KillPolicy and optional shutdown_grace_period field in ExecutorInfo. Manual testing was performed to ensure newly introduced protobuf fields are respected. To do that, "mesos-execute" was modified to support KillPolicy and CommandInfo.shell=false. To simulate a task that does not exit in the allotted period, a tiny app (https://github.com/rukletsov/unresponsive-process) that ignores SIGTERM was used. More details on testing in the review request. Review: https://reviews.apache.org/r/44657/ {noformat} was (Author: alexr): {noformat} Commit: 7ab6a478b9cef548a2470d18bd281aee5610b62a [7ab6a47] Author: Alexander Rukletsov ruklet...@gmail.com Date: 24 Mar 2016 17:30:31 CET Committer: Alexander Rukletsov al...@apache.org Commit Date: 24 Mar 2016 18:21:03 CET Introduced KillPolicy protobuf. Describes a kill policy for a task. Currently does not express different policies (e.g. hitting HTTP endpoints), only controls how long to wait between graceful and forcible task kill. Review: https://reviews.apache.org/r/44656/ {noformat} {noformat} Commit: 1fe6221aa30f35f31378433412d8cb725009bd47 [1fe6221] Author: Alexander Rukletsov ruklet...@gmail.com Date: 24 Mar 2016 17:30:42 CET Committer: Alexander Rukletsov al...@apache.org Commit Date: 24 Mar 2016 18:21:03 CET Added validation for task's kill policy. Review: https://reviews.apache.org/r/44707/ {noformat} > Introduce kill policy for tasks. > > > Key: MESOS-4909 > URL: https://issues.apache.org/jira/browse/MESOS-4909 > Project: Mesos > Issue Type: Improvement >Reporter: Alexander Rukletsov >Assignee: Alexander Rukletsov > Labels: mesosphere > > A task may require some time to clean up or even a special mechanism to issue > a kill request (currently it's a SIGTERM followed by SIGKILL). Introducing > kill policies per task will help address these issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2043) framework auth fail with timeout error and never get authenticated
[ https://issues.apache.org/jira/browse/MESOS-2043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam B updated MESOS-2043: -- Shepherd: Adam B > framework auth fail with timeout error and never get authenticated > -- > > Key: MESOS-2043 > URL: https://issues.apache.org/jira/browse/MESOS-2043 > Project: Mesos > Issue Type: Bug > Components: master, scheduler driver, security, slave >Affects Versions: 0.21.0 >Reporter: Bhuvan Arumugam >Priority: Critical > Labels: mesosphere, security > Fix For: 0.29.0 > > Attachments: aurora-scheduler.20141104-1606-1706.log, master.log, > mesos-master.20141104-1606-1706.log, slave.log > > > I'm facing this issue in master as of > https://github.com/apache/mesos/commit/74ea59e144d131814c66972fb0cc14784d3503d4 > As [~adam-mesos] mentioned in IRC, this sounds similar to MESOS-1866. I'm > running 1 master and 1 scheduler (aurora). The framework authentication fail > due to time out: > error on mesos master: > {code} > I1104 19:37:17.741449 8329 master.cpp:3874] Authenticating > scheduler-d2d4437b-d375-4467-a583-362152fe065a@SCHEDULER_IP:8083 > I1104 19:37:17.741585 8329 master.cpp:3885] Using default CRAM-MD5 > authenticator > I1104 19:37:17.742106 8336 authenticator.hpp:169] Creating new server SASL > connection > W1104 19:37:22.742959 8329 master.cpp:3953] Authentication timed out > W1104 19:37:22.743548 8329 master.cpp:3930] Failed to authenticate > scheduler-d2d4437b-d375-4467-a583-362152fe065a@SCHEDULER_IP:8083: > Authentication discarded > {code} > scheduler error: > {code} > I1104 19:38:57.885486 49012 sched.cpp:283] Authenticating with master > master@MASTER_IP:PORT > I1104 19:38:57.885928 49002 authenticatee.hpp:133] Creating new client SASL > connection > I1104 19:38:57.890581 49007 authenticatee.hpp:224] Received SASL > authentication mechanisms: CRAM-MD5 > I1104 19:38:57.890656 49007 authenticatee.hpp:250] Attempting to authenticate > with mechanism 'CRAM-MD5' > W1104 19:39:02.891196 49005 sched.cpp:378] Authentication timed out > I1104 19:39:02.891850 49018 sched.cpp:338] Failed to authenticate with master > master@MASTER_IP:PORT: Authentication discarded > {code} > Looks like 2 instances {{scheduler-20f88a53-5945-4977-b5af-28f6c52d3c94}} & > {{scheduler-d2d4437b-d375-4467-a583-362152fe065a}} of same framework is > trying to authenticate and fail. > {code} > W1104 19:36:30.769420 8319 master.cpp:3930] Failed to authenticate > scheduler-20f88a53-5945-4977-b5af-28f6c52d3c94@SCHEDULER_IP:8083: Failed to > communicate with authenticatee > I1104 19:36:42.701441 8328 master.cpp:3860] Queuing up authentication > request from scheduler-d2d4437b-d375-4467-a583-362152fe065a@SCHEDULER_IP:8083 > because authentication is still in progress > {code} > Restarting master and scheduler didn't fix it. > This particular issue happen with 1 master and 1 scheduler after MESOS-1866 > is fixed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4909) Introduce kill policy for tasks.
[ https://issues.apache.org/jira/browse/MESOS-4909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210609#comment-15210609 ] Alexander Rukletsov commented on MESOS-4909: {noformat} Commit: 7ab6a478b9cef548a2470d18bd281aee5610b62a [7ab6a47] Author: Alexander Rukletsov ruklet...@gmail.com Date: 24 Mar 2016 17:30:31 CET Committer: Alexander Rukletsov al...@apache.org Commit Date: 24 Mar 2016 18:21:03 CET Introduced KillPolicy protobuf. Describes a kill policy for a task. Currently does not express different policies (e.g. hitting HTTP endpoints), only controls how long to wait between graceful and forcible task kill. Review: https://reviews.apache.org/r/44656/ {noformat} {noformat} Commit: 1fe6221aa30f35f31378433412d8cb725009bd47 [1fe6221] Author: Alexander Rukletsov ruklet...@gmail.com Date: 24 Mar 2016 17:30:42 CET Committer: Alexander Rukletsov al...@apache.org Commit Date: 24 Mar 2016 18:21:03 CET Added validation for task's kill policy. Review: https://reviews.apache.org/r/44707/ {noformat} > Introduce kill policy for tasks. > > > Key: MESOS-4909 > URL: https://issues.apache.org/jira/browse/MESOS-4909 > Project: Mesos > Issue Type: Improvement >Reporter: Alexander Rukletsov >Assignee: Alexander Rukletsov > Labels: mesosphere > > A task may require some time to clean up or even a special mechanism to issue > a kill request (currently it's a SIGTERM followed by SIGKILL). Introducing > kill policies per task will help address these issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4992) sandbox uri does not work outisde mesos http server
[ https://issues.apache.org/jira/browse/MESOS-4992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam B updated MESOS-4992: -- Story Points: 3 > sandbox uri does not work outisde mesos http server > --- > > Key: MESOS-4992 > URL: https://issues.apache.org/jira/browse/MESOS-4992 > Project: Mesos > Issue Type: Bug > Components: webui >Affects Versions: 0.27.1 >Reporter: Stavros Kontopoulos > Labels: mesosphere > Fix For: 0.29.0 > > > The SandBox uri of a framework does not work if i just copy paste it to the > browser. > For example the following sandbox uri: > http://172.17.0.1:5050/#/slaves/50f87c73-79ef-4f2a-95f0-b2b4062b2de6-S0/frameworks/50f87c73-79ef-4f2a-95f0-b2b4062b2de6-0009/executors/driver-20160321155016-0001/browse > should redirect to: > http://172.17.0.1:5050/#/slaves/50f87c73-79ef-4f2a-95f0-b2b4062b2de6-S0/browse?path=%2Ftmp%2Fmesos%2Fslaves%2F50f87c73-79ef-4f2a-95f0-b2b4062b2de6-S0%2Fframeworks%2F50f87c73-79ef-4f2a-95f0-b2b4062b2de6-0009%2Fexecutors%2Fdriver-20160321155016-0001%2Fruns%2F60533483-31fb-4353-987d-f3393911cc80 > yet it fails with the message: > "Failed to find slaves. > Navigate to the slave's sandbox via the Mesos UI." > and redirects to: > http://172.17.0.1:5050/#/ > It is an issue for me because im working on expanding the mesos spark ui with > sandbox uri, The other option is to get the slave info and parse the json > file there and get executor paths not so straightforward or elegant though. > Moreover i dont see the runs/container_id in the Mesos Proto Api. I guess > this is hidden info, this is the needed piece of info to re-write the uri > without redirection. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4933) Registrar HTTP Authentication.
[ https://issues.apache.org/jira/browse/MESOS-4933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam B updated MESOS-4933: -- Story Points: 3 (was: 2) > Registrar HTTP Authentication. > -- > > Key: MESOS-4933 > URL: https://issues.apache.org/jira/browse/MESOS-4933 > Project: Mesos > Issue Type: Task >Reporter: Joerg Schad >Assignee: Jan Schlicht > Labels: authentication, mesosphere, security > > Now that the master (and agents in progress) provide http authentication the > registrar should do the same. > See http://mesos.apache.org/documentation/latest/endpoints/registrar/registry/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2043) framework auth fail with timeout error and never get authenticated
[ https://issues.apache.org/jira/browse/MESOS-2043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam B updated MESOS-2043: -- Story Points: 5 > framework auth fail with timeout error and never get authenticated > -- > > Key: MESOS-2043 > URL: https://issues.apache.org/jira/browse/MESOS-2043 > Project: Mesos > Issue Type: Bug > Components: master, scheduler driver, security, slave >Affects Versions: 0.21.0 >Reporter: Bhuvan Arumugam >Priority: Critical > Labels: mesosphere, security > Fix For: 0.29.0 > > Attachments: aurora-scheduler.20141104-1606-1706.log, master.log, > mesos-master.20141104-1606-1706.log, slave.log > > > I'm facing this issue in master as of > https://github.com/apache/mesos/commit/74ea59e144d131814c66972fb0cc14784d3503d4 > As [~adam-mesos] mentioned in IRC, this sounds similar to MESOS-1866. I'm > running 1 master and 1 scheduler (aurora). The framework authentication fail > due to time out: > error on mesos master: > {code} > I1104 19:37:17.741449 8329 master.cpp:3874] Authenticating > scheduler-d2d4437b-d375-4467-a583-362152fe065a@SCHEDULER_IP:8083 > I1104 19:37:17.741585 8329 master.cpp:3885] Using default CRAM-MD5 > authenticator > I1104 19:37:17.742106 8336 authenticator.hpp:169] Creating new server SASL > connection > W1104 19:37:22.742959 8329 master.cpp:3953] Authentication timed out > W1104 19:37:22.743548 8329 master.cpp:3930] Failed to authenticate > scheduler-d2d4437b-d375-4467-a583-362152fe065a@SCHEDULER_IP:8083: > Authentication discarded > {code} > scheduler error: > {code} > I1104 19:38:57.885486 49012 sched.cpp:283] Authenticating with master > master@MASTER_IP:PORT > I1104 19:38:57.885928 49002 authenticatee.hpp:133] Creating new client SASL > connection > I1104 19:38:57.890581 49007 authenticatee.hpp:224] Received SASL > authentication mechanisms: CRAM-MD5 > I1104 19:38:57.890656 49007 authenticatee.hpp:250] Attempting to authenticate > with mechanism 'CRAM-MD5' > W1104 19:39:02.891196 49005 sched.cpp:378] Authentication timed out > I1104 19:39:02.891850 49018 sched.cpp:338] Failed to authenticate with master > master@MASTER_IP:PORT: Authentication discarded > {code} > Looks like 2 instances {{scheduler-20f88a53-5945-4977-b5af-28f6c52d3c94}} & > {{scheduler-d2d4437b-d375-4467-a583-362152fe065a}} of same framework is > trying to authenticate and fail. > {code} > W1104 19:36:30.769420 8319 master.cpp:3930] Failed to authenticate > scheduler-20f88a53-5945-4977-b5af-28f6c52d3c94@SCHEDULER_IP:8083: Failed to > communicate with authenticatee > I1104 19:36:42.701441 8328 master.cpp:3860] Queuing up authentication > request from scheduler-d2d4437b-d375-4467-a583-362152fe065a@SCHEDULER_IP:8083 > because authentication is still in progress > {code} > Restarting master and scheduler didn't fix it. > This particular issue happen with 1 master and 1 scheduler after MESOS-1866 > is fixed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4933) Registrar HTTP Authentication.
[ https://issues.apache.org/jira/browse/MESOS-4933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam B updated MESOS-4933: -- Sprint: Mesosphere Sprint 32 > Registrar HTTP Authentication. > -- > > Key: MESOS-4933 > URL: https://issues.apache.org/jira/browse/MESOS-4933 > Project: Mesos > Issue Type: Task >Reporter: Joerg Schad >Assignee: Jan Schlicht > Labels: authentication, mesosphere, security > > Now that the master (and agents in progress) provide http authentication the > registrar should do the same. > See http://mesos.apache.org/documentation/latest/endpoints/registrar/registry/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4949) Executor shutdown grace period should be configurable.
[ https://issues.apache.org/jira/browse/MESOS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov updated MESOS-4949: --- Story Points: 3 (was: 1) > Executor shutdown grace period should be configurable. > -- > > Key: MESOS-4949 > URL: https://issues.apache.org/jira/browse/MESOS-4949 > Project: Mesos > Issue Type: Improvement >Reporter: Alexander Rukletsov >Assignee: Alexander Rukletsov > Labels: mesosphere > Fix For: 0.29.0 > > > Currently, executor shutdown grace period is specified by an agent flag, > which is propagated to executors via the > {{MESOS_EXECUTOR_SHUTDOWN_GRACE_PERIOD}} environment variable. There is no > way to adjust this timeout for the needs of a particular executor. > To tackle this problem, we propose to introduce an optional > {{shutdown_grace_period}} field in {{ExecutorInfo}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2043) framework auth fail with timeout error and never get authenticated
[ https://issues.apache.org/jira/browse/MESOS-2043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam B updated MESOS-2043: -- Fix Version/s: 0.29.0 > framework auth fail with timeout error and never get authenticated > -- > > Key: MESOS-2043 > URL: https://issues.apache.org/jira/browse/MESOS-2043 > Project: Mesos > Issue Type: Bug > Components: master, scheduler driver, security, slave >Affects Versions: 0.21.0 >Reporter: Bhuvan Arumugam >Priority: Critical > Labels: mesosphere, security > Fix For: 0.29.0 > > Attachments: aurora-scheduler.20141104-1606-1706.log, master.log, > mesos-master.20141104-1606-1706.log, slave.log > > > I'm facing this issue in master as of > https://github.com/apache/mesos/commit/74ea59e144d131814c66972fb0cc14784d3503d4 > As [~adam-mesos] mentioned in IRC, this sounds similar to MESOS-1866. I'm > running 1 master and 1 scheduler (aurora). The framework authentication fail > due to time out: > error on mesos master: > {code} > I1104 19:37:17.741449 8329 master.cpp:3874] Authenticating > scheduler-d2d4437b-d375-4467-a583-362152fe065a@SCHEDULER_IP:8083 > I1104 19:37:17.741585 8329 master.cpp:3885] Using default CRAM-MD5 > authenticator > I1104 19:37:17.742106 8336 authenticator.hpp:169] Creating new server SASL > connection > W1104 19:37:22.742959 8329 master.cpp:3953] Authentication timed out > W1104 19:37:22.743548 8329 master.cpp:3930] Failed to authenticate > scheduler-d2d4437b-d375-4467-a583-362152fe065a@SCHEDULER_IP:8083: > Authentication discarded > {code} > scheduler error: > {code} > I1104 19:38:57.885486 49012 sched.cpp:283] Authenticating with master > master@MASTER_IP:PORT > I1104 19:38:57.885928 49002 authenticatee.hpp:133] Creating new client SASL > connection > I1104 19:38:57.890581 49007 authenticatee.hpp:224] Received SASL > authentication mechanisms: CRAM-MD5 > I1104 19:38:57.890656 49007 authenticatee.hpp:250] Attempting to authenticate > with mechanism 'CRAM-MD5' > W1104 19:39:02.891196 49005 sched.cpp:378] Authentication timed out > I1104 19:39:02.891850 49018 sched.cpp:338] Failed to authenticate with master > master@MASTER_IP:PORT: Authentication discarded > {code} > Looks like 2 instances {{scheduler-20f88a53-5945-4977-b5af-28f6c52d3c94}} & > {{scheduler-d2d4437b-d375-4467-a583-362152fe065a}} of same framework is > trying to authenticate and fail. > {code} > W1104 19:36:30.769420 8319 master.cpp:3930] Failed to authenticate > scheduler-20f88a53-5945-4977-b5af-28f6c52d3c94@SCHEDULER_IP:8083: Failed to > communicate with authenticatee > I1104 19:36:42.701441 8328 master.cpp:3860] Queuing up authentication > request from scheduler-d2d4437b-d375-4467-a583-362152fe065a@SCHEDULER_IP:8083 > because authentication is still in progress > {code} > Restarting master and scheduler didn't fix it. > This particular issue happen with 1 master and 1 scheduler after MESOS-1866 > is fixed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4992) sandbox uri does not work outisde mesos http server
[ https://issues.apache.org/jira/browse/MESOS-4992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam B updated MESOS-4992: -- Fix Version/s: 0.29.0 > sandbox uri does not work outisde mesos http server > --- > > Key: MESOS-4992 > URL: https://issues.apache.org/jira/browse/MESOS-4992 > Project: Mesos > Issue Type: Bug > Components: webui >Affects Versions: 0.27.1 >Reporter: Stavros Kontopoulos > Labels: mesosphere > Fix For: 0.29.0 > > > The SandBox uri of a framework does not work if i just copy paste it to the > browser. > For example the following sandbox uri: > http://172.17.0.1:5050/#/slaves/50f87c73-79ef-4f2a-95f0-b2b4062b2de6-S0/frameworks/50f87c73-79ef-4f2a-95f0-b2b4062b2de6-0009/executors/driver-20160321155016-0001/browse > should redirect to: > http://172.17.0.1:5050/#/slaves/50f87c73-79ef-4f2a-95f0-b2b4062b2de6-S0/browse?path=%2Ftmp%2Fmesos%2Fslaves%2F50f87c73-79ef-4f2a-95f0-b2b4062b2de6-S0%2Fframeworks%2F50f87c73-79ef-4f2a-95f0-b2b4062b2de6-0009%2Fexecutors%2Fdriver-20160321155016-0001%2Fruns%2F60533483-31fb-4353-987d-f3393911cc80 > yet it fails with the message: > "Failed to find slaves. > Navigate to the slave's sandbox via the Mesos UI." > and redirects to: > http://172.17.0.1:5050/#/ > It is an issue for me because im working on expanding the mesos spark ui with > sandbox uri, The other option is to get the slave info and parse the json > file there and get executor paths not so straightforward or elegant though. > Moreover i dont see the runs/container_id in the Mesos Proto Api. I guess > this is hidden info, this is the needed piece of info to re-write the uri > without redirection. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3003) Support mounting in default configuration files/volumes into every new container
[ https://issues.apache.org/jira/browse/MESOS-3003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210523#comment-15210523 ] Jie Yu commented on MESOS-3003: --- I'm closing this ticket because all the network related /etc/* files should be handled by the network isolator because it (and only it) knows about the ip, hostname information. Mounting in host /etc/* files blindly does not make sense. > Support mounting in default configuration files/volumes into every new > container > > > Key: MESOS-3003 > URL: https://issues.apache.org/jira/browse/MESOS-3003 > Project: Mesos > Issue Type: Improvement > Components: containerization >Reporter: Timothy Chen >Assignee: Gilbert Song > Labels: mesosphere, unified-containerizer-mvp > > Most container images leave out system configuration (e.g: /etc/*) and expect > the container runtimes to mount in specific configurations as needed such as > /etc/resolv.conf from the host into the container when needed. > We need to support mounting in specific configuration files for command > executor to work, and also allow the user to optionally define other > configuration files to mount in as well via flags. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-5024) local docker puller uses colon in tarball names
James Peach created MESOS-5024: -- Summary: local docker puller uses colon in tarball names Key: MESOS-5024 URL: https://issues.apache.org/jira/browse/MESOS-5024 Project: Mesos Issue Type: Task Components: containerization Reporter: James Peach Priority: Trivial The local docker puller for the unifier containerizer expects tagged docker repository images to be named {{repository:tag.tar}}. However, tar(1) would normally interpret that as a remote archive: {quote} -f, --file=ARCHIVE ... An archive name that has a colon in it specifies a file or device on a remote machine. The part before the colon is taken as the machine name or IP address, and the part after it as the file or device pathname ... {quote} This works correctly only because the puller always passes an absolute path to tar(1), which causes it to interpret the name as a local archive again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4694) DRFAllocator takes very long to allocate resources with a large number of frameworks
[ https://issues.apache.org/jira/browse/MESOS-4694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210399#comment-15210399 ] Joris Van Remoortere commented on MESOS-4694: - {code} commit 6a8738f89b01ac3ddd70c418c49f350e17fa Author: Dario RexinDate: Thu Mar 24 14:10:31 2016 +0100 Allocator Performance: Exited early to avoid needless computation. Review: https://reviews.apache.org/r/43668/ {code} > DRFAllocator takes very long to allocate resources with a large number of > frameworks > > > Key: MESOS-4694 > URL: https://issues.apache.org/jira/browse/MESOS-4694 > Project: Mesos > Issue Type: Improvement > Components: allocation >Affects Versions: 0.26.0, 0.27.0, 0.27.1, 0.28.0, 0.27.2, 0.28.1 >Reporter: Dario Rexin >Assignee: Dario Rexin > > With a growing number of connected frameworks, the allocation time grows to > very high numbers. The addition of quota in 0.27 had an additional impact on > these numbers. Running `mesos-tests.sh --benchmark > --gtest_filter=HierarchicalAllocator_BENCHMARK_Test.DeclineOffers` gives us > the following numbers: > {noformat} > [==] Running 1 test from 1 test case. > [--] Global test environment set-up. > [--] 1 test from HierarchicalAllocator_BENCHMARK_Test > [ RUN ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers > Using 2000 slaves and 200 frameworks > round 0 allocate took 2.921202secs to make 200 offers > round 1 allocate took 2.85045secs to make 200 offers > round 2 allocate took 2.823768secs to make 200 offers > {noformat} > Increasing the number of frameworks to 2000: > {noformat} > [==] Running 1 test from 1 test case. > [--] Global test environment set-up. > [--] 1 test from HierarchicalAllocator_BENCHMARK_Test > [ RUN ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers > Using 2000 slaves and 2000 frameworks > round 0 allocate took 28.209454secs to make 2000 offers > round 1 allocate took 28.469419secs to make 2000 offers > round 2 allocate took 28.138086secs to make 2000 offers > {noformat} > I was able to reduce this time by a substantial amount. After applying the > patches: > {noformat} > [==] Running 1 test from 1 test case. > [--] Global test environment set-up. > [--] 1 test from HierarchicalAllocator_BENCHMARK_Test > [ RUN ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers > Using 2000 slaves and 200 frameworks > round 0 allocate took 1.016226secs to make 2000 offers > round 1 allocate took 1.102729secs to make 2000 offers > round 2 allocate took 1.102624secs to make 2000 offers > {noformat} > And with 2000 frameworks: > {noformat} > [==] Running 1 test from 1 test case. > [--] Global test environment set-up. > [--] 1 test from HierarchicalAllocator_BENCHMARK_Test > [ RUN ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers > Using 2000 slaves and 2000 frameworks > round 0 allocate took 12.563203secs to make 2000 offers > round 1 allocate took 12.437517secs to make 2000 offers > round 2 allocate took 12.470708secs to make 2000 offers > {noformat} > The patches do 3 things to improve the performance of the allocator. > 1) The total values in the DRFSorter will be pre calculated per resource type > 2) In the allocate method, when no resources are available to allocate, we > break out of the innermost loop to prevent looping over a large number of > frameworks when we have nothing to allocate > 3) when a framework suppresses offers, we remove it from the sorter instead > of just calling continue in the allocation loop - this greatly improves > performance in the sorter and prevents looping over frameworks that don't > need resources > Assuming that most of the frameworks behave nicely and suppress offers when > they have nothing to schedule, it is fair to assume, that point 3) has the > biggest impact on the performance. If we suppress offers for 90% of the > frameworks in the benchmark test, we see following numbers: > {noformat} > ==] Running 1 test from 1 test case. > [--] Global test environment set-up. > [--] 1 test from HierarchicalAllocator_BENCHMARK_Test > [ RUN ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers > Using 200 slaves and 2000 frameworks > round 0 allocate took 11626us to make 200 offers > round 1 allocate took 22890us to make 200 offers > round 2 allocate took 21346us to make 200 offers > {noformat} > And for 200 frameworks: > {noformat} > [==] Running 1 test from 1 test case. > [--] Global test environment set-up. > [--] 1 test from HierarchicalAllocator_BENCHMARK_Test > [ RUN ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers > Using 2000 slaves and
[jira] [Updated] (MESOS-4694) DRFAllocator takes very long to allocate resources with a large number of frameworks
[ https://issues.apache.org/jira/browse/MESOS-4694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van Remoortere updated MESOS-4694: Affects Version/s: 0.28.1 0.28.0 0.27.2 > DRFAllocator takes very long to allocate resources with a large number of > frameworks > > > Key: MESOS-4694 > URL: https://issues.apache.org/jira/browse/MESOS-4694 > Project: Mesos > Issue Type: Improvement > Components: allocation >Affects Versions: 0.26.0, 0.27.0, 0.27.1, 0.28.0, 0.27.2, 0.28.1 >Reporter: Dario Rexin >Assignee: Dario Rexin > > With a growing number of connected frameworks, the allocation time grows to > very high numbers. The addition of quota in 0.27 had an additional impact on > these numbers. Running `mesos-tests.sh --benchmark > --gtest_filter=HierarchicalAllocator_BENCHMARK_Test.DeclineOffers` gives us > the following numbers: > {noformat} > [==] Running 1 test from 1 test case. > [--] Global test environment set-up. > [--] 1 test from HierarchicalAllocator_BENCHMARK_Test > [ RUN ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers > Using 2000 slaves and 200 frameworks > round 0 allocate took 2.921202secs to make 200 offers > round 1 allocate took 2.85045secs to make 200 offers > round 2 allocate took 2.823768secs to make 200 offers > {noformat} > Increasing the number of frameworks to 2000: > {noformat} > [==] Running 1 test from 1 test case. > [--] Global test environment set-up. > [--] 1 test from HierarchicalAllocator_BENCHMARK_Test > [ RUN ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers > Using 2000 slaves and 2000 frameworks > round 0 allocate took 28.209454secs to make 2000 offers > round 1 allocate took 28.469419secs to make 2000 offers > round 2 allocate took 28.138086secs to make 2000 offers > {noformat} > I was able to reduce this time by a substantial amount. After applying the > patches: > {noformat} > [==] Running 1 test from 1 test case. > [--] Global test environment set-up. > [--] 1 test from HierarchicalAllocator_BENCHMARK_Test > [ RUN ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers > Using 2000 slaves and 200 frameworks > round 0 allocate took 1.016226secs to make 2000 offers > round 1 allocate took 1.102729secs to make 2000 offers > round 2 allocate took 1.102624secs to make 2000 offers > {noformat} > And with 2000 frameworks: > {noformat} > [==] Running 1 test from 1 test case. > [--] Global test environment set-up. > [--] 1 test from HierarchicalAllocator_BENCHMARK_Test > [ RUN ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers > Using 2000 slaves and 2000 frameworks > round 0 allocate took 12.563203secs to make 2000 offers > round 1 allocate took 12.437517secs to make 2000 offers > round 2 allocate took 12.470708secs to make 2000 offers > {noformat} > The patches do 3 things to improve the performance of the allocator. > 1) The total values in the DRFSorter will be pre calculated per resource type > 2) In the allocate method, when no resources are available to allocate, we > break out of the innermost loop to prevent looping over a large number of > frameworks when we have nothing to allocate > 3) when a framework suppresses offers, we remove it from the sorter instead > of just calling continue in the allocation loop - this greatly improves > performance in the sorter and prevents looping over frameworks that don't > need resources > Assuming that most of the frameworks behave nicely and suppress offers when > they have nothing to schedule, it is fair to assume, that point 3) has the > biggest impact on the performance. If we suppress offers for 90% of the > frameworks in the benchmark test, we see following numbers: > {noformat} > ==] Running 1 test from 1 test case. > [--] Global test environment set-up. > [--] 1 test from HierarchicalAllocator_BENCHMARK_Test > [ RUN ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers > Using 200 slaves and 2000 frameworks > round 0 allocate took 11626us to make 200 offers > round 1 allocate took 22890us to make 200 offers > round 2 allocate took 21346us to make 200 offers > {noformat} > And for 200 frameworks: > {noformat} > [==] Running 1 test from 1 test case. > [--] Global test environment set-up. > [--] 1 test from HierarchicalAllocator_BENCHMARK_Test > [ RUN ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers > Using 2000 slaves and 2000 frameworks > round 0 allocate took 1.11178secs to make 2000 offers > round 1 allocate took 1.062649secs to make 2000 offers > round 2 allocate took 1.080181secs to make 2000 offers > {noformat} > Review requests: >
[jira] [Created] (MESOS-5023) MesosContainerizerProvisionerTest.ProvisionFailed is flaky.
Alexander Rukletsov created MESOS-5023: -- Summary: MesosContainerizerProvisionerTest.ProvisionFailed is flaky. Key: MESOS-5023 URL: https://issues.apache.org/jira/browse/MESOS-5023 Project: Mesos Issue Type: Bug Reporter: Alexander Rukletsov Observed on the Apache Jenkins. {noformat} [ RUN ] MesosContainerizerProvisionerTest.ProvisionFailed I0324 13:38:56.284261 2948 containerizer.cpp:666] Starting container 'test_container' for executor 'executor' of framework '' I0324 13:38:56.285825 2939 containerizer.cpp:1421] Destroying container 'test_container' I0324 13:38:56.285854 2939 containerizer.cpp:1424] Waiting for the provisioner to complete for container 'test_container' [ OK ] MesosContainerizerProvisionerTest.ProvisionFailed (7 ms) [ RUN ] MesosContainerizerProvisionerTest.DestroyWhileProvisioning I0324 13:38:56.291187 2944 containerizer.cpp:666] Starting container 'c2316963-c6cb-4c7f-a3b9-17ca5931e5b2' for executor 'executor' of framework '' I0324 13:38:56.292157 2944 containerizer.cpp:1421] Destroying container 'c2316963-c6cb-4c7f-a3b9-17ca5931e5b2' I0324 13:38:56.292179 2944 containerizer.cpp:1424] Waiting for the provisioner to complete for container 'c2316963-c6cb-4c7f-a3b9-17ca5931e5b2' F0324 13:38:56.292899 2944 containerizer.cpp:752] Check failed: containers_.contains(containerId) *** Check failure stack trace: *** @ 0x2ac9973d0ae4 google::LogMessage::Fail() @ 0x2ac9973d0a30 google::LogMessage::SendToLog() @ 0x2ac9973d0432 google::LogMessage::Flush() @ 0x2ac9973d3346 google::LogMessageFatal::~LogMessageFatal() @ 0x2ac996af897c mesos::internal::slave::MesosContainerizerProcess::_launch() @ 0x2ac996b1f18a _ZZN7process8dispatchIbN5mesos8internal5slave25MesosContainerizerProcessERKNS1_11ContainerIDERK6OptionINS1_8TaskInfoEERKNS1_12ExecutorInfoERKSsRKS8_ISsERKNS1_7SlaveIDERKNS_3PIDINS3_5SlaveEEEbRKS8_INS3_13ProvisionInfoEES5_SA_SD_SsSI_SL_SQ_bSU_EENS_6FutureIT_EERKNSO_IT0_EEMS10_FSZ_T1_T2_T3_T4_T5_T6_T7_T8_T9_ET10_T11_T12_T13_T14_T15_T16_T17_T18_ENKUlPNS_11ProcessBaseEE_clES1P_ @ 0x2ac996b479d9 _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIbN5mesos8internal5slave25MesosContainerizerProcessERKNS5_11ContainerIDERK6OptionINS5_8TaskInfoEERKNS5_12ExecutorInfoERKSsRKSC_ISsERKNS5_7SlaveIDERKNS0_3PIDINS7_5SlaveEEEbRKSC_INS7_13ProvisionInfoEES9_SE_SH_SsSM_SP_SU_bSY_EENS0_6FutureIT_EERKNSS_IT0_EEMS14_FS13_T1_T2_T3_T4_T5_T6_T7_T8_T9_ET10_T11_T12_T13_T14_T15_T16_T17_T18_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_ @ 0x2ac997334fef std::function<>::operator()() @ 0x2ac99731b1c7 process::ProcessBase::visit() @ 0x2ac997321154 process::DispatchEvent::visit() @ 0x9a699c process::ProcessBase::serve() @ 0x2ac9973173c0 process::ProcessManager::resume() @ 0x2ac99731445a _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_ @ 0x2ac997320916 _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE @ 0x2ac9973208c6 _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_ @ 0x2ac997320858 _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE @ 0x2ac9973207af _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEEclEv @ 0x2ac997320748 _ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv @ 0x2ac9989aea60 (unknown) @ 0x2ac999125182 start_thread @ 0x2ac99943547d (unknown) make[4]: Leaving directory `/mesos/mesos-0.29.0/_build/src' make[4]: *** [check-local] Aborted make[3]: *** [check-am] Error 2 make[3]: Leaving directory `/mesos/mesos-0.29.0/_build/src' make[2]: *** [check] Error 2 make[2]: Leaving directory `/mesos/mesos-0.29.0/_build/src' make[1]: *** [check-recursive] Error 1 make[1]: Leaving directory `/mesos/mesos-0.29.0/_build' make: *** [distcheck] Error 1 Build step 'Execute shell' marked build as failure {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-4933) Registrar HTTP Authentication.
[ https://issues.apache.org/jira/browse/MESOS-4933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210388#comment-15210388 ] Jan Schlicht edited comment on MESOS-4933 at 3/24/16 3:17 PM: -- Yes, the approach will be similar to MESOS-4956. was (Author: nfnt): Yes, my approach will be similar to MESOS-4956. > Registrar HTTP Authentication. > -- > > Key: MESOS-4933 > URL: https://issues.apache.org/jira/browse/MESOS-4933 > Project: Mesos > Issue Type: Task >Reporter: Joerg Schad >Assignee: Jan Schlicht > Labels: authentication, mesosphere, security > > Now that the master (and agents in progress) provide http authentication the > registrar should do the same. > See http://mesos.apache.org/documentation/latest/endpoints/registrar/registry/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4933) Registrar HTTP Authentication.
[ https://issues.apache.org/jira/browse/MESOS-4933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210388#comment-15210388 ] Jan Schlicht commented on MESOS-4933: - Yes, my approach will be similar to MESOS-4956. > Registrar HTTP Authentication. > -- > > Key: MESOS-4933 > URL: https://issues.apache.org/jira/browse/MESOS-4933 > Project: Mesos > Issue Type: Task >Reporter: Joerg Schad >Assignee: Jan Schlicht > Labels: authentication, mesosphere, security > > Now that the master (and agents in progress) provide http authentication the > registrar should do the same. > See http://mesos.apache.org/documentation/latest/endpoints/registrar/registry/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1420) Shorten slave, framework and run IDs
[ https://issues.apache.org/jira/browse/MESOS-1420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam B updated MESOS-1420: -- Shepherd: (was: Adam B) > Shorten slave, framework and run IDs > > > Key: MESOS-1420 > URL: https://issues.apache.org/jira/browse/MESOS-1420 > Project: Mesos > Issue Type: Bug >Reporter: Robert Lacroix >Priority: Minor > > Slave, framework and run IDs are currently quite long and therefore clutter > paths to sandboxes. Typically a sandbox path looks like this: > {code} > /tmp/mesos/slaves/2014-05-23-16:21:05-16777343-5050-3204-0/frameworks/2014-05-23-16:21:05-16777343-5050-3204-/executors//runs/c22e4dc3-95e5-49fc-9793-b0d22a4f244c > {code} > I'd propose shorter and uniform IDs for slaves, frameworks and runs (and > probably everywhere else where we use IDs) that look like this: > {code} > [a-z0-9]{13} > {code} > This has about 65bit keyspace compared to 128bit of a UUID, but I think it > should be random enough. > With that the path would be roughly 80 chars shorter (179 vs 99) and a lot > more readable: > {code} > /tmp/mesos/slaves/i0b195fb1j14n/frameworks/gtq5kgba60ll4/executors//runs/7qisjqsb581io > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4308) Reliably report executor terminations to framework schedulers.
[ https://issues.apache.org/jira/browse/MESOS-4308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam B updated MESOS-4308: -- Shepherd: (was: Adam B) > Reliably report executor terminations to framework schedulers. > -- > > Key: MESOS-4308 > URL: https://issues.apache.org/jira/browse/MESOS-4308 > Project: Mesos > Issue Type: Improvement >Reporter: Charles Reiss > Labels: mesosphere > > Now that executor terminations are reported (unreliably), we should > investigate queuing up these messages (on the agent?) and resending them > periodically until we get an acknowledgement, much like status updates do. > From MESOS-313: The Scheduler interface has a callback for executorLost, but > currently it is never called. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3656) Port process/socket.hpp to Windows
[ https://issues.apache.org/jira/browse/MESOS-3656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210024#comment-15210024 ] Joris Van Remoortere commented on MESOS-3656: - {code} commit 4e19c3e6f09eaa2793f4717e414429e0e6335e0f Author: Daniel PravatDate: Thu Mar 24 09:33:05 2016 +0100 Windows: [2/2] Lifted socket API into Stout. Review: https://reviews.apache.org/r/44139/ commit 6f8544cf5e2748a58ac979e6d12336b2dccbf1fb Author: Daniel Pravat Date: Thu Mar 24 09:32:57 2016 +0100 Windows: [1/2] Lifted socket API into Stout. Review: https://reviews.apache.org/r/44138/ {code} > Port process/socket.hpp to Windows > -- > > Key: MESOS-3656 > URL: https://issues.apache.org/jira/browse/MESOS-3656 > Project: Mesos > Issue Type: Task > Components: libprocess >Reporter: Alex Clemmer >Assignee: Alex Clemmer > Labels: mesosphere, windows > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4431) Sharing of persistent volumes via reference counting
[ https://issues.apache.org/jira/browse/MESOS-4431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam B updated MESOS-4431: -- Shepherd: (was: Adam B) > Sharing of persistent volumes via reference counting > > > Key: MESOS-4431 > URL: https://issues.apache.org/jira/browse/MESOS-4431 > Project: Mesos > Issue Type: Improvement > Components: general >Affects Versions: 0.25.0 >Reporter: Anindya Sinha >Assignee: Anindya Sinha > Labels: external-volumes, persistent-volumes > > Add capability for specific resources to be shared amongst tasks within or > across frameworks/roles. Enable this functionality for persistent volumes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4893) Allow setting permissions and access control on persistent volumes
[ https://issues.apache.org/jira/browse/MESOS-4893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209993#comment-15209993 ] Deshi Xiao commented on MESOS-4893: --- i have reading the design doc. i prefer add backward compatibility ability. > Allow setting permissions and access control on persistent volumes > -- > > Key: MESOS-4893 > URL: https://issues.apache.org/jira/browse/MESOS-4893 > Project: Mesos > Issue Type: Improvement > Components: general >Reporter: Anindya Sinha >Assignee: Anindya Sinha > Labels: external-volumes, persistent-volumes > > Currently, persistent volumes are exclusive, i.e. that if a persistent volume > is used by one task or executor, it cannot be concurrently used by other task > or executor. > With the introduction of shared volumes, persistent volumes can be used > simultaneously by multiple tasks or executors. As a result, we need to > introduce setting up of ownership of persistent volumes at creation of > volumes which the tasks need to follow. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4760) Expose metrics and gauges for fetcher cache usage and hit rate
[ https://issues.apache.org/jira/browse/MESOS-4760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209957#comment-15209957 ] Deshi Xiao commented on MESOS-4760: --- +1 > Expose metrics and gauges for fetcher cache usage and hit rate > -- > > Key: MESOS-4760 > URL: https://issues.apache.org/jira/browse/MESOS-4760 > Project: Mesos > Issue Type: Improvement > Components: fetcher, statistics >Reporter: Michael Browning >Priority: Minor > Labels: features, fetcher, statistics > > To evaluate the fetcher cache and calibrate the value of the > fetcher_cache_size flag, it would be useful to have metrics and gauges on > agents that expose operational statistics like cache hit rate, occupied cache > size, and time spent downloading resources that were not present. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4812) Mesos fails to escape command health checks
[ https://issues.apache.org/jira/browse/MESOS-4812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Bannier updated MESOS-4812: Assignee: (was: Benjamin Bannier) > Mesos fails to escape command health checks > --- > > Key: MESOS-4812 > URL: https://issues.apache.org/jira/browse/MESOS-4812 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.25.0 >Reporter: Lukas Loesche > > As described in https://github.com/mesosphere/marathon/issues/ > I would like to run a command health check > {noformat} > /bin/bash -c " {noformat} > The health check fails because Mesos, while running the command inside double > quotes of a sh -c "" doesn't escape the double quotes in the command. > If I escape the double quotes myself the command health check succeeds. But > this would mean that the user needs intimate knowledge of how Mesos executes > his commands which can't be right. > I was told this is not a Marathon but a Mesos issue so am opening this JIRA. > I don't know if this only affects the command health check. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-5022) Provide LDAP as default authorisation
Klaus Ma created MESOS-5022: --- Summary: Provide LDAP as default authorisation Key: MESOS-5022 URL: https://issues.apache.org/jira/browse/MESOS-5022 Project: Mesos Issue Type: Epic Reporter: Klaus Ma Assignee: Klaus Ma The default authorisation/ACL is configured by {{json}} file; operator has to restart master if any new user was added. It's better to provide LDAP as default ACL: 1. Provide an real example on ACL interface 2. Provide an default auth plugin for user to use -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4984) MasterTest.SlavesEndpointTwoSlaves is flaky
[ https://issues.apache.org/jira/browse/MESOS-4984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam B updated MESOS-4984: -- Sprint: Mesosphere Sprint 31 Story Points: 2 > MasterTest.SlavesEndpointTwoSlaves is flaky > --- > > Key: MESOS-4984 > URL: https://issues.apache.org/jira/browse/MESOS-4984 > Project: Mesos > Issue Type: Bug > Components: tests >Reporter: Neil Conway >Assignee: Anand Mazumdar > Labels: flaky-test, mesosphere, tech-debt > Fix For: 0.29.0 > > Attachments: slaves_endpoint_flaky_4984_verbose_log.txt > > > Observed on Arch Linux with GCC 6, running in a virtualbox VM: > [ RUN ] MasterTest.SlavesEndpointTwoSlaves > /mesos-2/src/tests/master_tests.cpp:1710: Failure > Value of: array.get().values.size() > Actual: 1 > Expected: 2u > Which is: 2 > [ FAILED ] MasterTest.SlavesEndpointTwoSlaves (86 ms) > Seems to fail non-deterministically, perhaps more often when there is > concurrent CPU load on the machine. -- This message was sent by Atlassian JIRA (v6.3.4#6332)