date:20140918


 [ 
https://issues.apache.org/jira/browse/MESOS-1818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone reassigned MESOS-1818:
-

Assignee: Vinod Kone

> AllocatorTest/0.ResourcesUnused sometimes segfaults
> ---
>
> Key: MESOS-1818
> URL: https://issues.apache.org/jira/browse/MESOS-1818
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.21.0
>Reporter: Vinod Kone
>Assignee: Vinod Kone
>Priority: Critical
>
> {code}
> [ RUN  ] AllocatorTest/0.ResourcesUnused
> *** Aborted at 1411088950 (unix time) try "date -d @1411088950" if you are 
> using GNU date ***
> PC: @   0x8649a4 mesos::SlaveID::value()
> *** SIGSEGV (@0x2de9) received by PID 20876 (TID 0x7fb63a1c0940) from PID 
> 11753; stack trace: ***
> @ 0x7fb643ec4ca0 (unknown)
> @   0x8649a4 mesos::SlaveID::value()
> @   0x8741c7 mesos::hash_value()
> @   0x8f7448 boost::hash<>::operator()()
> @   0x8e0bed 
> boost::unordered::detail::mix64_policy<>::apply_hash<>()
> @ 0x7fb64694c1cf boost::unordered::detail::table<>::hash()
> @ 0x7fb646973615 boost::unordered::detail::table<>::find_node()
> @ 0x7fb64694c191 boost::unordered::detail::table_impl<>::count()
> @ 0x7fb64691f3c1 boost::unordered::unordered_map<>::count()
> @ 0x7fb6468f4373 hashmap<>::contains()
> @ 0x7fb6468c5eda mesos::internal::master::Master::getSlave()
> @ 0x7fb6468c0fc3 mesos::internal::master::Master::removeFramework()
> @ 0x7fb6468afa9f 
> mesos::internal::master::Master::unregisterFramework()
> @ 0x7fb646904ab9 ProtobufProcess<>::handler1<>()
> @ 0x7fb6469a1e81 
> _ZNSt5_BindIFPFvPN5mesos8internal6master6MasterEMS3_FvRKN7process4UPIDERKNS0_11FrameworkIDEEMNS1_26UnregisterFrameworkMessageEKFSB_vES8_RKSsES4_SD_SG_St12_PlaceholderILi1EESL_ILi26__callIvJS8_SI_EJLm0ELm1ELm2ELm3ELm4T_OSt5tupleIJDpT0_EESt12_Index_tupleIJXspT1_EEE
> @ 0x7fb646983afe std::_Bind<>::operator()<>()
> @ 0x7fb64695f83c std::_Function_handler<>::_M_invoke()
> @   0xc4e17f std::function<>::operator()()
> @ 0x7fb6468ebd10 ProtobufProcess<>::visit()
> @ 0x7fb6468a9892 mesos::internal::master::Master::_visit()
> @ 0x7fb6468a8f46 mesos::internal::master::Master::visit()
> @ 0x7fb6468ce670 process::MessageEvent::visit()
> @   0x86ad54 process::ProcessBase::serve()
> @ 0x7fb6470e9738 process::ProcessManager::resume()
> @ 0x7fb6470dff3f process::schedule()
> @ 0x7fb643ebc83d start_thread
> @ 0x7fb642c2426d clone
> make[3]: *** [check-local] Segmentation fault
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-1818) AllocatorTest/0.ResourcesUnused sometimes segfaults


 [ 
https://issues.apache.org/jira/browse/MESOS-1818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-1818:
--
Assignee: Benjamin Mahler  (was: Vinod Kone)

> AllocatorTest/0.ResourcesUnused sometimes segfaults
> ---
>
> Key: MESOS-1818
> URL: https://issues.apache.org/jira/browse/MESOS-1818
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.21.0
>Reporter: Vinod Kone
>Assignee: Benjamin Mahler
>Priority: Critical
>
> {code}
> [ RUN  ] AllocatorTest/0.ResourcesUnused
> *** Aborted at 1411088950 (unix time) try "date -d @1411088950" if you are 
> using GNU date ***
> PC: @   0x8649a4 mesos::SlaveID::value()
> *** SIGSEGV (@0x2de9) received by PID 20876 (TID 0x7fb63a1c0940) from PID 
> 11753; stack trace: ***
> @ 0x7fb643ec4ca0 (unknown)
> @   0x8649a4 mesos::SlaveID::value()
> @   0x8741c7 mesos::hash_value()
> @   0x8f7448 boost::hash<>::operator()()
> @   0x8e0bed 
> boost::unordered::detail::mix64_policy<>::apply_hash<>()
> @ 0x7fb64694c1cf boost::unordered::detail::table<>::hash()
> @ 0x7fb646973615 boost::unordered::detail::table<>::find_node()
> @ 0x7fb64694c191 boost::unordered::detail::table_impl<>::count()
> @ 0x7fb64691f3c1 boost::unordered::unordered_map<>::count()
> @ 0x7fb6468f4373 hashmap<>::contains()
> @ 0x7fb6468c5eda mesos::internal::master::Master::getSlave()
> @ 0x7fb6468c0fc3 mesos::internal::master::Master::removeFramework()
> @ 0x7fb6468afa9f 
> mesos::internal::master::Master::unregisterFramework()
> @ 0x7fb646904ab9 ProtobufProcess<>::handler1<>()
> @ 0x7fb6469a1e81 
> _ZNSt5_BindIFPFvPN5mesos8internal6master6MasterEMS3_FvRKN7process4UPIDERKNS0_11FrameworkIDEEMNS1_26UnregisterFrameworkMessageEKFSB_vES8_RKSsES4_SD_SG_St12_PlaceholderILi1EESL_ILi26__callIvJS8_SI_EJLm0ELm1ELm2ELm3ELm4T_OSt5tupleIJDpT0_EESt12_Index_tupleIJXspT1_EEE
> @ 0x7fb646983afe std::_Bind<>::operator()<>()
> @ 0x7fb64695f83c std::_Function_handler<>::_M_invoke()
> @   0xc4e17f std::function<>::operator()()
> @ 0x7fb6468ebd10 ProtobufProcess<>::visit()
> @ 0x7fb6468a9892 mesos::internal::master::Master::_visit()
> @ 0x7fb6468a8f46 mesos::internal::master::Master::visit()
> @ 0x7fb6468ce670 process::MessageEvent::visit()
> @   0x86ad54 process::ProcessBase::serve()
> @ 0x7fb6470e9738 process::ProcessManager::resume()
> @ 0x7fb6470dff3f process::schedule()
> @ 0x7fb643ebc83d start_thread
> @ 0x7fb642c2426d clone
> make[3]: *** [check-local] Segmentation fault
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-1813) Fail fast in example frameworks if task goes into unexpected state


 [ 
https://issues.apache.org/jira/browse/MESOS-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-1813:
--
Sprint: Mesos Q3 Sprint 5

> Fail fast in example frameworks if task goes into unexpected state
> --
>
> Key: MESOS-1813
> URL: https://issues.apache.org/jira/browse/MESOS-1813
> Project: Mesos
>  Issue Type: Improvement
>  Components: test
>Reporter: Vinod Kone
>Assignee: Vinod Kone
> Fix For: 0.21.0
>
>
> Most of the example frameworks launch a bunch of tasks and exit if *all* of 
> them reach FINISHED state. But if there is a bug in the code resulting in 
> TASK_LOST, the framework waits forever. Instead the framework should abort if 
> an un-expected task state is encountered.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-1814) Task attempted to use more offers than requested in example jave and python frameworks


 [ 
https://issues.apache.org/jira/browse/MESOS-1814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-1814:
--
Shepherd: Benjamin Mahler  (was: Yan Xu)

> Task attempted to use more offers than requested in example jave and python 
> frameworks
> --
>
> Key: MESOS-1814
> URL: https://issues.apache.org/jira/browse/MESOS-1814
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.21.0
>Reporter: Vinod Kone
>Assignee: Vinod Kone
>
> {code}
> [ RUN  ] ExamplesTest.JavaFramework
> Using temporary directory '/tmp/ExamplesTest_JavaFramework_2PcFCh'
> Enabling authentication for the framework
> WARNING: Logging before InitGoogleLogging() is written to STDERR
> I0917 23:14:35.199069 31510 process.cpp:1771] libprocess is initialized on 
> 127.0.1.1:34609 for 8 cpus
> I0917 23:14:35.199794 31510 logging.cpp:177] Logging to STDERR
> I0917 23:14:35.225342 31510 leveldb.cpp:176] Opened db in 22.197149ms
> I0917 23:14:35.231133 31510 leveldb.cpp:183] Compacted db in 5.601897ms
> I0917 23:14:35.231498 31510 leveldb.cpp:198] Created db iterator in 215441ns
> I0917 23:14:35.231608 31510 leveldb.cpp:204] Seeked to beginning of db in 
> 11488ns
> I0917 23:14:35.231722 31510 leveldb.cpp:273] Iterated through 0 keys in the 
> db in 14016ns
> I0917 23:14:35.231917 31510 replica.cpp:741] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I0917 23:14:35.233129 31526 recover.cpp:425] Starting replica recovery
> I0917 23:14:35.233614 31526 recover.cpp:451] Replica is in EMPTY status
> I0917 23:14:35.234994 31526 replica.cpp:638] Replica in EMPTY status received 
> a broadcasted recover request
> I0917 23:14:35.240116 31519 recover.cpp:188] Received a recover response from 
> a replica in EMPTY status
> I0917 23:14:35.240782 31519 recover.cpp:542] Updating replica status to 
> STARTING
> I0917 23:14:35.242846 31524 master.cpp:286] Master 
> 20140917-231435-16842879-34609-31503 (saucy) started on 127.0.1.1:34609
> I0917 23:14:35.243191 31524 master.cpp:332] Master only allowing 
> authenticated frameworks to register
> I0917 23:14:35.243288 31524 master.cpp:339] Master allowing unauthenticated 
> slaves to register
> I0917 23:14:35.243399 31524 credentials.hpp:36] Loading credentials for 
> authentication from '/tmp/ExamplesTest_JavaFramework_2PcFCh/credentials'
> W0917 23:14:35.243588 31524 credentials.hpp:51] Permissions on credentials 
> file '/tmp/ExamplesTest_JavaFramework_2PcFCh/credentials' are too open. It is 
> recommended that your credentials file is NOT accessible by others.
> I0917 23:14:35.243846 31524 master.cpp:366] Authorization enabled
> I0917 23:14:35.244882 31520 hierarchical_allocator_process.hpp:299] 
> Initializing hierarchical allocator process with master : 
> master@127.0.1.1:34609
> I0917 23:14:35.245224 31520 master.cpp:120] No whitelist given. Advertising 
> offers for all slaves
> I0917 23:14:35.246934 31524 master.cpp:1211] The newly elected leader is 
> master@127.0.1.1:34609 with id 20140917-231435-16842879-34609-31503
> I0917 23:14:35.247234 31524 master.cpp:1224] Elected as the leading master!
> I0917 23:14:35.247336 31524 master.cpp:1042] Recovering from registrar
> I0917 23:14:35.247542 31526 registrar.cpp:313] Recovering registrar
> I0917 23:14:35.250555 31510 containerizer.cpp:89] Using isolation: 
> posix/cpu,posix/mem
> I0917 23:14:35.252326 31510 containerizer.cpp:89] Using isolation: 
> posix/cpu,posix/mem
> I0917 23:14:35.252821 31520 slave.cpp:169] Slave started on 1)@127.0.1.1:34609
> I0917 23:14:35.253552 31520 slave.cpp:289] Slave resources: cpus(*):1; 
> mem(*):1001; disk(*):24988; ports(*):[31000-32000]
> I0917 23:14:35.253906 31520 slave.cpp:317] Slave hostname: saucy
> I0917 23:14:35.254004 31520 slave.cpp:318] Slave checkpoint: true
> I0917 23:14:35.254818 31520 state.cpp:33] Recovering state from 
> '/tmp/mesos-w8snRW/0/meta'
> I0917 23:14:35.255106 31519 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 13.99622ms
> I0917 23:14:35.255235 31519 replica.cpp:320] Persisted replica status to 
> STARTING
> I0917 23:14:35.255419 31519 recover.cpp:451] Replica is in STARTING status
> I0917 23:14:35.255834 31519 replica.cpp:638] Replica in STARTING status 
> received a broadcasted recover request
> I0917 23:14:35.256000 31519 recover.cpp:188] Received a recover response from 
> a replica in STARTING status
> I0917 23:14:35.256217 31519 recover.cpp:542] Updating replica status to VOTING
> I0917 23:14:35.256641 31520 status_update_manager.cpp:193] Recovering status 
> update manager
> I0917 23:14:35.257064 31520 containerizer.cpp:252] Recovering containerizer
> I0917 23:14:35.257725 31520 slave.cpp:3220] Finished recovery
> I0917 23:14:35.258463 31520 slave.cpp:600] New ma

[jira] [Commented] (MESOS-1384) Add support for loadable MesosModule


[ 
https://issues.apache.org/jira/browse/MESOS-1384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14139841#comment-14139841
 ] 

Bernd Mathiske commented on MESOS-1384:
---

OK, JSON it is then. 

> Add support for loadable MesosModule
> 
>
> Key: MESOS-1384
> URL: https://issues.apache.org/jira/browse/MESOS-1384
> Project: Mesos
>  Issue Type: Improvement
>Affects Versions: 0.19.0
>Reporter: Timothy St. Clair
>Assignee: Niklas Quarfot Nielsen
>
> I think we should break this into multiple phases.
> -(1) Let's get the dynamic library loading via a "stout-ified" version of 
> https://github.com/timothysc/tests/blob/master/plugin_modules/DynamicLibrary.h.
>  -
> *DONE*
> (2) Use (1) to instantiate some classes in Mesos (like an Authenticator 
> and/or isolator) from a dynamic library. This will give us some more 
> experience with how we want to name the underlying library symbol, how we 
> want to specify flags for finding the library, what types of validation we 
> want when loading a library.
> *TARGET* 
> (3) After doing (2) for one or two classes in Mesos I think we can formalize 
> the approach in a "mesos-ified" version of 
> https://github.com/timothysc/tests/blob/master/plugin_modules/MesosModule.h.
> *NEXT*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-1818) AllocatorTest/0.ResourcesUnused sometimes segfaults

Vinod Kone created MESOS-1818:
-

 Summary: AllocatorTest/0.ResourcesUnused sometimes segfaults
 Key: MESOS-1818
 URL: https://issues.apache.org/jira/browse/MESOS-1818
 Project: Mesos
  Issue Type: Bug
  Components: test
Affects Versions: 0.21.0
Reporter: Vinod Kone
Priority: Critical


{code}
[ RUN  ] AllocatorTest/0.ResourcesUnused
*** Aborted at 1411088950 (unix time) try "date -d @1411088950" if you are 
using GNU date ***
PC: @   0x8649a4 mesos::SlaveID::value()
*** SIGSEGV (@0x2de9) received by PID 20876 (TID 0x7fb63a1c0940) from PID 
11753; stack trace: ***
@ 0x7fb643ec4ca0 (unknown)
@   0x8649a4 mesos::SlaveID::value()
@   0x8741c7 mesos::hash_value()
@   0x8f7448 boost::hash<>::operator()()
@   0x8e0bed 
boost::unordered::detail::mix64_policy<>::apply_hash<>()
@ 0x7fb64694c1cf boost::unordered::detail::table<>::hash()
@ 0x7fb646973615 boost::unordered::detail::table<>::find_node()
@ 0x7fb64694c191 boost::unordered::detail::table_impl<>::count()
@ 0x7fb64691f3c1 boost::unordered::unordered_map<>::count()
@ 0x7fb6468f4373 hashmap<>::contains()
@ 0x7fb6468c5eda mesos::internal::master::Master::getSlave()
@ 0x7fb6468c0fc3 mesos::internal::master::Master::removeFramework()
@ 0x7fb6468afa9f mesos::internal::master::Master::unregisterFramework()
@ 0x7fb646904ab9 ProtobufProcess<>::handler1<>()
@ 0x7fb6469a1e81 
_ZNSt5_BindIFPFvPN5mesos8internal6master6MasterEMS3_FvRKN7process4UPIDERKNS0_11FrameworkIDEEMNS1_26UnregisterFrameworkMessageEKFSB_vES8_RKSsES4_SD_SG_St12_PlaceholderILi1EESL_ILi26__callIvJS8_SI_EJLm0ELm1ELm2ELm3ELm4T_OSt5tupleIJDpT0_EESt12_Index_tupleIJXspT1_EEE
@ 0x7fb646983afe std::_Bind<>::operator()<>()
@ 0x7fb64695f83c std::_Function_handler<>::_M_invoke()
@   0xc4e17f std::function<>::operator()()
@ 0x7fb6468ebd10 ProtobufProcess<>::visit()
@ 0x7fb6468a9892 mesos::internal::master::Master::_visit()
@ 0x7fb6468a8f46 mesos::internal::master::Master::visit()
@ 0x7fb6468ce670 process::MessageEvent::visit()
@   0x86ad54 process::ProcessBase::serve()
@ 0x7fb6470e9738 process::ProcessManager::resume()
@ 0x7fb6470dff3f process::schedule()
@ 0x7fb643ebc83d start_thread
@ 0x7fb642c2426d clone
make[3]: *** [check-local] Segmentation fault
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-1817) Completed tasks remains in TASK_RUNNING when framework is disconnected

2014-09-18 Thread Niklas Quarfot Nielsen (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-1817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-1817:
--
Description: 
We have run into a problem that cause tasks which completes, when a framework 
is disconnected and has a fail-over time, to remain in a running state even 
though the tasks actually finishes. This hogs the cluster and gives users a 
inconsistent view of the cluster state. Going to the slave, the task is 
finished. Going to the master, the task is still in a non-terminal state. When 
the scheduler reattaches or the failover timeout expires, the tasks finishes 
correctly. The current workflow of this scheduler has a long fail-over timeout, 
but may on the other hand never reattach.

Here is a test framework we have been able to reproduce the issue with: 
https://gist.github.com/nqn/9b9b1de9123a6e836f54
It launches many short-lived tasks (1 second sleep) and when killing the 
framework instance, the master reports the tasks as running even after several 
minutes: 
http://cl.ly/image/2R3719461e0t/Screen%20Shot%202014-09-10%20at%203.19.39%20PM.png

When clicking on one of the slaves where, for example, task 49 runs; the slave 
knows that it completed: 
http://cl.ly/image/2P410L3m1O1N/Screen%20Shot%202014-09-10%20at%203.21.29%20PM.png

Here is the log of a mesos-local instance where I reproduced it: 
https://gist.github.com/nqn/f7ee20601199d70787c0 (Here task 10 to 19 are stuck 
in running state).
There is a lot of output, so here is a filtered log for task 10: 
https://gist.github.com/nqn/a53e5ea05c5e41cd5a7d

The problem turn out to be an issue with the ack-cycle of status updates:
If the framework disconnects (with a failover timeout set), the status update 
manage on the slaves will keep trying to send the front of status update stream 
to the master (which in turn forwards it to the framework). If the first status 
update after the disconnect is terminal, things work out fine; the master pick 
the terminal state up, removes the task and release the resources.
If, on the other hand, one non-terminal status is in the stream. The master 
will never know that the task finished (or failed) before the framework 
reconnects.

During a discussion on the dev mailing list 
(http://mail-archives.apache.org/mod_mbox/mesos-dev/201409.mbox/%3cCADKthhAVR5mrq1s9HXw1BB_XFALXWWxjutp7MV4y3wP-Bh=a...@mail.gmail.com%3e)
 we enumerated a couple of options to solve this problem.

First off, having two ack-cycles: one between masters and slaves and one 
between masters and frameworks, would be ideal. We would be able to replay the 
statuses in order while keeping the master state current. However, this 
requires us to persist the master state in a replicated storage.

As a first pass, we can make sure that the tasks caught in a running state 
doesn't hog the cluster when completed and the framework being disconnected.

Here is a proof-of-concept to work out of: 
https://github.com/nqn/mesos/tree/niklas/status-update-disconnect/

A new (optional) field have been added to the internal status update message:
https://github.com/nqn/mesos/blob/niklas/status-update-disconnect/src/messages/messages.proto#L68

Which makes it possible for the status update manager to set the field, if the 
latest status was terminal: 
https://github.com/nqn/mesos/blob/niklas/status-update-disconnect/src/slave/status_update_manager.cpp#L501

I added a test which should high-light the issue as well:
https://github.com/nqn/mesos/blob/niklas/status-update-disconnect/src/tests/fault_tolerance_tests.cpp#L2478

I would love some input on the approach before moving on.
There are rough edges in the PoC which (of course) should be addressed before 
bringing it for up review.

  was:
We have run into a problem that cause tasks which completes, when a framework 
is disconnected and has a fail-over time, to remain in a running state even 
though the tasks actually finishes. This hogs the cluster and gives users a 
inconsistent view of the cluster state. Going to the slave, the task is 
finished. Going to the master, the task is still in a non-terminal state. When 
the scheduler reattaches or the failover timeout expires, the tasks finishes 
correctly. The current workflow of this scheduler has a long fail-over timeout, 
but may on the other hand never reattach.

Here is a test framework we have been able to reproduce the issue with: 
https://gist.github.com/nqn/9b9b1de9123a6e836f54
It launches many short-lived tasks (1 second sleep) and when killing the 
framework instance, the master reports the tasks as running even after several 
minutes: 
http://cl.ly/image/2R3719461e0t/Screen%20Shot%202014-09-10%20at%203.19.39%20PM.png

When clicking on one of the slaves where, for example, task 49 runs; the slave 
knows that it completed: 
http://cl.ly/image/2P410L3m1O1N/Screen%20Shot%202014-09-10%20at%203.21.29%20PM.png

Here is the log of a mesos-loc

[jira] [Created] (MESOS-1817) Completed tasks remains in TASK_RUNNING when framework is disconnected

2014-09-18 Thread Niklas Quarfot Nielsen (JIRA)

Niklas Quarfot Nielsen created MESOS-1817:
-

 Summary: Completed tasks remains in TASK_RUNNING when framework is 
disconnected
 Key: MESOS-1817
 URL: https://issues.apache.org/jira/browse/MESOS-1817
 Project: Mesos
  Issue Type: Bug
Reporter: Niklas Quarfot Nielsen


We have run into a problem that cause tasks which completes, when a framework 
is disconnected and has a fail-over time, to remain in a running state even 
though the tasks actually finishes. This hogs the cluster and gives users a 
inconsistent view of the cluster state. Going to the slave, the task is 
finished. Going to the master, the task is still in a non-terminal state. When 
the scheduler reattaches or the failover timeout expires, the tasks finishes 
correctly. The current workflow of this scheduler has a long fail-over timeout, 
but may on the other hand never reattach.

Here is a test framework we have been able to reproduce the issue with: 
https://gist.github.com/nqn/9b9b1de9123a6e836f54
It launches many short-lived tasks (1 second sleep) and when killing the 
framework instance, the master reports the tasks as running even after several 
minutes: 
http://cl.ly/image/2R3719461e0t/Screen%20Shot%202014-09-10%20at%203.19.39%20PM.png

When clicking on one of the slaves where, for example, task 49 runs; the slave 
knows that it completed: 
http://cl.ly/image/2P410L3m1O1N/Screen%20Shot%202014-09-10%20at%203.21.29%20PM.png

Here is the log of a mesos-local instance where I reproduced it: 
https://gist.github.com/nqn/f7ee20601199d70787c0 (Here task 10 to 19 are stuck 
in running state).
There is a lot of output, so here is a filtered log for task 10: 
https://gist.github.com/nqn/a53e5ea05c5e41cd5a7d

The problem turn out to be an issue with the ack-cycle of status updates:
If the framework disconnects (with a failover timeout set), the status update 
manage on the slaves will keep trying to send the front of status update stream 
to the master (which in turn forwards it to the framework). If the first status 
update after the disconnect is terminal, things work out fine; the master pick 
the terminal state up, removes the task and release the resources.
If, on the other hand, one non-terminal status is in the stream. The master 
will never know that the task finished (or failed) before the framework 
reconnects.

During a discussion on the dev mailing list 
(http://mail-archives.apache.org/mod_mbox/mesos-dev/201409.mbox/%3cCADKthhAVR5mrq1s9HXw1BB_XFALXWWxjutp7MV4y3wP-Bh=a...@mail.gmail.com%3e)
 we enumerated a couple of options to solve this problem.

First off, having two ack-cycles: one between masters and slaves and one 
between masters and frameworks, would be ideal. We would be able to replay the 
statuses in order while keeping the master state current. However, this 
requires us to persist the master state in a replicated storage.

As a first pass, we can make sure that the tasks caught in a running state 
doesn't hog the cluster when completed and the framework being disconnected.

Here is a proof-of-concept to work out of: 
https://github.com/nqn/mesos/blob/niklas/status-update-disconnect/

A new (optional) field have been added to the internal status update message:
https://github.com/nqn/mesos/blob/niklas/status-update-disconnect/src/messages/messages.proto#L68

Which makes it possible for the status update manager to set the field, if the 
latest status was terminal: 
https://github.com/nqn/mesos/blob/niklas/status-update-disconnect/src/slave/status_update_manager.cpp#L501

I added a test which should high-light the issue as well:
https://github.com/nqn/mesos/blob/niklas/status-update-disconnect/src/tests/fault_tolerance_tests.cpp#L2478

I would love some input on the approach before moving on.
There are rough edges in the PoC which (of course) should be addressed before 
bringing it for up review.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (MESOS-1813) Fail fast in example frameworks if task goes into unexpected state


 [ 
https://issues.apache.org/jira/browse/MESOS-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone reassigned MESOS-1813:
-

Assignee: Vinod Kone

> Fail fast in example frameworks if task goes into unexpected state
> --
>
> Key: MESOS-1813
> URL: https://issues.apache.org/jira/browse/MESOS-1813
> Project: Mesos
>  Issue Type: Improvement
>  Components: test
>Reporter: Vinod Kone
>Assignee: Vinod Kone
>
> Most of the example frameworks launch a bunch of tasks and exit if *all* of 
> them reach FINISHED state. But if there is a bug in the code resulting in 
> TASK_LOST, the framework waits forever. Instead the framework should abort if 
> an un-expected task state is encountered.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-1813) Fail fast in example frameworks if task goes into unexpected state


[ 
https://issues.apache.org/jira/browse/MESOS-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14139777#comment-14139777
 ] 

Vinod Kone commented on MESOS-1813:
---

https://reviews.apache.org/r/25805/

> Fail fast in example frameworks if task goes into unexpected state
> --
>
> Key: MESOS-1813
> URL: https://issues.apache.org/jira/browse/MESOS-1813
> Project: Mesos
>  Issue Type: Improvement
>  Components: test
>Reporter: Vinod Kone
>Assignee: Vinod Kone
>
> Most of the example frameworks launch a bunch of tasks and exit if *all* of 
> them reach FINISHED state. But if there is a bug in the code resulting in 
> TASK_LOST, the framework waits forever. Instead the framework should abort if 
> an un-expected task state is encountered.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-1814) Task attempted to use more offers than requested in example jave and python frameworks


[ 
https://issues.apache.org/jira/browse/MESOS-1814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14139686#comment-14139686
 ] 

Vinod Kone commented on MESOS-1814:
---

https://reviews.apache.org/r/25801/

> Task attempted to use more offers than requested in example jave and python 
> frameworks
> --
>
> Key: MESOS-1814
> URL: https://issues.apache.org/jira/browse/MESOS-1814
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.21.0
>Reporter: Vinod Kone
>Assignee: Vinod Kone
>
> {code}
> [ RUN  ] ExamplesTest.JavaFramework
> Using temporary directory '/tmp/ExamplesTest_JavaFramework_2PcFCh'
> Enabling authentication for the framework
> WARNING: Logging before InitGoogleLogging() is written to STDERR
> I0917 23:14:35.199069 31510 process.cpp:1771] libprocess is initialized on 
> 127.0.1.1:34609 for 8 cpus
> I0917 23:14:35.199794 31510 logging.cpp:177] Logging to STDERR
> I0917 23:14:35.225342 31510 leveldb.cpp:176] Opened db in 22.197149ms
> I0917 23:14:35.231133 31510 leveldb.cpp:183] Compacted db in 5.601897ms
> I0917 23:14:35.231498 31510 leveldb.cpp:198] Created db iterator in 215441ns
> I0917 23:14:35.231608 31510 leveldb.cpp:204] Seeked to beginning of db in 
> 11488ns
> I0917 23:14:35.231722 31510 leveldb.cpp:273] Iterated through 0 keys in the 
> db in 14016ns
> I0917 23:14:35.231917 31510 replica.cpp:741] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I0917 23:14:35.233129 31526 recover.cpp:425] Starting replica recovery
> I0917 23:14:35.233614 31526 recover.cpp:451] Replica is in EMPTY status
> I0917 23:14:35.234994 31526 replica.cpp:638] Replica in EMPTY status received 
> a broadcasted recover request
> I0917 23:14:35.240116 31519 recover.cpp:188] Received a recover response from 
> a replica in EMPTY status
> I0917 23:14:35.240782 31519 recover.cpp:542] Updating replica status to 
> STARTING
> I0917 23:14:35.242846 31524 master.cpp:286] Master 
> 20140917-231435-16842879-34609-31503 (saucy) started on 127.0.1.1:34609
> I0917 23:14:35.243191 31524 master.cpp:332] Master only allowing 
> authenticated frameworks to register
> I0917 23:14:35.243288 31524 master.cpp:339] Master allowing unauthenticated 
> slaves to register
> I0917 23:14:35.243399 31524 credentials.hpp:36] Loading credentials for 
> authentication from '/tmp/ExamplesTest_JavaFramework_2PcFCh/credentials'
> W0917 23:14:35.243588 31524 credentials.hpp:51] Permissions on credentials 
> file '/tmp/ExamplesTest_JavaFramework_2PcFCh/credentials' are too open. It is 
> recommended that your credentials file is NOT accessible by others.
> I0917 23:14:35.243846 31524 master.cpp:366] Authorization enabled
> I0917 23:14:35.244882 31520 hierarchical_allocator_process.hpp:299] 
> Initializing hierarchical allocator process with master : 
> master@127.0.1.1:34609
> I0917 23:14:35.245224 31520 master.cpp:120] No whitelist given. Advertising 
> offers for all slaves
> I0917 23:14:35.246934 31524 master.cpp:1211] The newly elected leader is 
> master@127.0.1.1:34609 with id 20140917-231435-16842879-34609-31503
> I0917 23:14:35.247234 31524 master.cpp:1224] Elected as the leading master!
> I0917 23:14:35.247336 31524 master.cpp:1042] Recovering from registrar
> I0917 23:14:35.247542 31526 registrar.cpp:313] Recovering registrar
> I0917 23:14:35.250555 31510 containerizer.cpp:89] Using isolation: 
> posix/cpu,posix/mem
> I0917 23:14:35.252326 31510 containerizer.cpp:89] Using isolation: 
> posix/cpu,posix/mem
> I0917 23:14:35.252821 31520 slave.cpp:169] Slave started on 1)@127.0.1.1:34609
> I0917 23:14:35.253552 31520 slave.cpp:289] Slave resources: cpus(*):1; 
> mem(*):1001; disk(*):24988; ports(*):[31000-32000]
> I0917 23:14:35.253906 31520 slave.cpp:317] Slave hostname: saucy
> I0917 23:14:35.254004 31520 slave.cpp:318] Slave checkpoint: true
> I0917 23:14:35.254818 31520 state.cpp:33] Recovering state from 
> '/tmp/mesos-w8snRW/0/meta'
> I0917 23:14:35.255106 31519 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 13.99622ms
> I0917 23:14:35.255235 31519 replica.cpp:320] Persisted replica status to 
> STARTING
> I0917 23:14:35.255419 31519 recover.cpp:451] Replica is in STARTING status
> I0917 23:14:35.255834 31519 replica.cpp:638] Replica in STARTING status 
> received a broadcasted recover request
> I0917 23:14:35.256000 31519 recover.cpp:188] Received a recover response from 
> a replica in STARTING status
> I0917 23:14:35.256217 31519 recover.cpp:542] Updating replica status to VOTING
> I0917 23:14:35.256641 31520 status_update_manager.cpp:193] Recovering status 
> update manager
> I0917 23:14:35.257064 31520 containerizer.cpp:252] Recovering containerizer
> I0917 23:14:35.257725 31520 slave.cpp:3220] Finished recovery
>

[jira] [Updated] (MESOS-1816) lxc execution driver support for docker containerizer

2014-09-18 Thread Eugen Feller (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-1816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugen Feller updated MESOS-1816:

Description: 
Hi all,

One way to get networking up and running in Docker is to use the bridge mode. 
The bridge mode results in Docker automatically assigning IPs to the containers 
from the IP range specified on the docker0 bridge.

In our setup we need to manage IPs using our own DHCP server. Unfortunately 
this is not supported by Docker's libcontainer execution driver. Instead, the 
lxc execution driver 
(http://blog.docker.com/2014/03/docker-0-9-introducing-execution-drivers-and-libcontainer/)
 can be used. In order to use the lxc execution driver, Docker daemon needs to 
be started with the "-e lxc" flag. Once started, Docker own networking can be 
disabled and lxc options can be passed to the docker run command. For example:

$ docker run -n=false --lxc-conf="lxc.network.type = veth" 
--lxc-conf="lxc.network.link = br0" --lxc-conf="lxc.network.name = eth0" 
-lxc-conf="lxc.network.flags = up" ...

This will force Docker to use my own bridge br0. Moreover, IP can be assigned 
to the eth0 interface by executing the "dhclient eth0" command inside the 
started container.

In the previous integration of Docker in Mesos (using Deimos), I have passed 
the aforementioned options using the "options" flag in Marathon. However, with 
the new changes this is no longer possible. It would be great to support the 
lxc execution driver in the current Docker integration.

Thanks.

Best regards,
Eugen

  was:
Hi all,

One way to get networking up and running in Docker is to use the bridge mode. 
The bridge mode results in Docker automatically assigning IPs to the containers 
from the IP range specified on the docker0 bridge.

In our setup we need to manage IPs using our own DHCP server. Unfortunately 
this is not supported by Docker's libcontainer execution driver. Instead, the 
lxc execution driver 
(http://blog.docker.com/2014/03/docker-0-9-introducing-execution-drivers-and-libcontainer/)
 can be used. In order to use the lxc execution driver, Docker daemon needs to 
be started with the "-e lxc" flag. Once started, Docker own networking can be 
disabled and lxc options can be passed to the docker run command. For example:

$ docker run -n=false --lxc-conf="lxc.network.type = veth" 
--lxc-conf="lxc.network.link = br0" --lxc-conf="lxc.network.name = eth0" 
-lxc-conf="lxc.network.flags = up"

This will force Docker to use my own bridge br0. Moreover, IP can be assigned 
to the eth0 interface by executing the "dhclient eth0" command inside the 
started container.

In the previous integration of Docker in Mesos (using Deimos), I have passed 
the aforementioned options using the "options" flag in Marathon. However, with 
the new changes this is no longer possible. It would be great to support the 
lxc execution driver in the current Docker integration.

Thanks.

Best regards,
Eugen


> lxc execution driver support for docker containerizer
> -
>
> Key: MESOS-1816
> URL: https://issues.apache.org/jira/browse/MESOS-1816
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Affects Versions: 0.20.1
>Reporter: Eugen Feller
>  Labels: docker
>
> Hi all,
> One way to get networking up and running in Docker is to use the bridge mode. 
> The bridge mode results in Docker automatically assigning IPs to the 
> containers from the IP range specified on the docker0 bridge.
> In our setup we need to manage IPs using our own DHCP server. Unfortunately 
> this is not supported by Docker's libcontainer execution driver. Instead, the 
> lxc execution driver 
> (http://blog.docker.com/2014/03/docker-0-9-introducing-execution-drivers-and-libcontainer/)
>  can be used. In order to use the lxc execution driver, Docker daemon needs 
> to be started with the "-e lxc" flag. Once started, Docker own networking can 
> be disabled and lxc options can be passed to the docker run command. For 
> example:
> $ docker run -n=false --lxc-conf="lxc.network.type = veth" 
> --lxc-conf="lxc.network.link = br0" --lxc-conf="lxc.network.name = eth0" 
> -lxc-conf="lxc.network.flags = up" ...
> This will force Docker to use my own bridge br0. Moreover, IP can be assigned 
> to the eth0 interface by executing the "dhclient eth0" command inside the 
> started container.
> In the previous integration of Docker in Mesos (using Deimos), I have passed 
> the aforementioned options using the "options" flag in Marathon. However, 
> with the new changes this is no longer possible. It would be great to support 
> the lxc execution driver in the current Docker integration.
> Thanks.
> Best regards,
> Eugen



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-1816) lxc execution driver for docker containerizer

2014-09-18 Thread Eugen Feller (JIRA)

Eugen Feller created MESOS-1816:
---

Summary: lxc execution driver for docker containerizer
Key: MESOS-1816
URL: https://issues.apache.org/jira/browse/MESOS-1816
Project: Mesos
Issue Type: Improvement
Components: containerization
Affects Versions: 0.20.1
Reporter: Eugen Feller

Hi all,

One way to get networking up and running in Docker is to use the bridge mode.
The bridge mode results in Docker automatically assigning IPs to the containers
from the IP range specified on the docker0 bridge.

In our setup we need to manage IPs using our own DHCP server. Unfortunately
this is not supported by Docker's libcontainer execution driver. Instead, the
lxc execution driver
(http://blog.docker.com/2014/03/docker-0-9-introducing-execution-drivers-and-libcontainer/)
can be used. In order to use the lxc execution driver, Docker daemon needs to
be started with the "-e lxc" flag. Once started, Docker own networking can be
disabled and lxc options can be passed to the docker run command. For example:

$ docker run -n=false --lxc-conf="lxc.network.type = veth"
--lxc-conf="lxc.network.link = br0" --lxc-conf="lxc.network.name = eth0"
-lxc-conf="lxc.network.flags = up"

This will force Docker to use my own bridge br0. Moreover, IP can be assigned
to the eth0 interface by executing the "dhclient eth0" command inside the
started container.

In the previous integration of Docker in Mesos (using Deimos), I have passed
the aforementioned options using the "options" flag in Marathon. However, with
the new changes this is no longer possible. It would be great to support the
lxc execution driver in the current Docker integration.

Thanks.

Best regards,
Eugen

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-1816) lxc execution driver support for docker containerizer

2014-09-18 Thread Eugen Feller (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-1816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugen Feller updated MESOS-1816:

Summary: lxc execution driver support for docker containerizer  (was: lxc 
execution driver for docker containerizer)

> lxc execution driver support for docker containerizer
> -
>
> Key: MESOS-1816
> URL: https://issues.apache.org/jira/browse/MESOS-1816
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Affects Versions: 0.20.1
>Reporter: Eugen Feller
>  Labels: docker
>
> Hi all,
> One way to get networking up and running in Docker is to use the bridge mode. 
> The bridge mode results in Docker automatically assigning IPs to the 
> containers from the IP range specified on the docker0 bridge.
> In our setup we need to manage IPs using our own DHCP server. Unfortunately 
> this is not supported by Docker's libcontainer execution driver. Instead, the 
> lxc execution driver 
> (http://blog.docker.com/2014/03/docker-0-9-introducing-execution-drivers-and-libcontainer/)
>  can be used. In order to use the lxc execution driver, Docker daemon needs 
> to be started with the "-e lxc" flag. Once started, Docker own networking can 
> be disabled and lxc options can be passed to the docker run command. For 
> example:
> $ docker run -n=false --lxc-conf="lxc.network.type = veth" 
> --lxc-conf="lxc.network.link = br0" --lxc-conf="lxc.network.name = eth0" 
> -lxc-conf="lxc.network.flags = up"
> This will force Docker to use my own bridge br0. Moreover, IP can be assigned 
> to the eth0 interface by executing the "dhclient eth0" command inside the 
> started container.
> In the previous integration of Docker in Mesos (using Deimos), I have passed 
> the aforementioned options using the "options" flag in Marathon. However, 
> with the new changes this is no longer possible. It would be great to support 
> the lxc execution driver in the current Docker integration.
> Thanks.
> Best regards,
> Eugen



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-1384) Add support for loadable MesosModule

2014-09-18 Thread Niklas Quarfot Nielsen (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-1384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14139575#comment-14139575
 ] 

Niklas Quarfot Nielsen commented on MESOS-1384:
---

[~vinodkone] +1

> Add support for loadable MesosModule
> 
>
> Key: MESOS-1384
> URL: https://issues.apache.org/jira/browse/MESOS-1384
> Project: Mesos
>  Issue Type: Improvement
>Affects Versions: 0.19.0
>Reporter: Timothy St. Clair
>Assignee: Niklas Quarfot Nielsen
>
> I think we should break this into multiple phases.
> -(1) Let's get the dynamic library loading via a "stout-ified" version of 
> https://github.com/timothysc/tests/blob/master/plugin_modules/DynamicLibrary.h.
>  -
> *DONE*
> (2) Use (1) to instantiate some classes in Mesos (like an Authenticator 
> and/or isolator) from a dynamic library. This will give us some more 
> experience with how we want to name the underlying library symbol, how we 
> want to specify flags for finding the library, what types of validation we 
> want when loading a library.
> *TARGET* 
> (3) After doing (2) for one or two classes in Mesos I think we can formalize 
> the approach in a "mesos-ified" version of 
> https://github.com/timothysc/tests/blob/master/plugin_modules/MesosModule.h.
> *NEXT*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-1384) Add support for loadable MesosModule


[ 
https://issues.apache.org/jira/browse/MESOS-1384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14139553#comment-14139553
 ] 

Vinod Kone commented on MESOS-1384:
---

Please have the flag as JSON. It's easy to maintain. Our JSON flag parser 
accepts a file with JSON or raw JSON string.

> Add support for loadable MesosModule
> 
>
> Key: MESOS-1384
> URL: https://issues.apache.org/jira/browse/MESOS-1384
> Project: Mesos
>  Issue Type: Improvement
>Affects Versions: 0.19.0
>Reporter: Timothy St. Clair
>Assignee: Niklas Quarfot Nielsen
>
> I think we should break this into multiple phases.
> -(1) Let's get the dynamic library loading via a "stout-ified" version of 
> https://github.com/timothysc/tests/blob/master/plugin_modules/DynamicLibrary.h.
>  -
> *DONE*
> (2) Use (1) to instantiate some classes in Mesos (like an Authenticator 
> and/or isolator) from a dynamic library. This will give us some more 
> experience with how we want to name the underlying library symbol, how we 
> want to specify flags for finding the library, what types of validation we 
> want when loading a library.
> *TARGET* 
> (3) After doing (2) for one or two classes in Mesos I think we can formalize 
> the approach in a "mesos-ified" version of 
> https://github.com/timothysc/tests/blob/master/plugin_modules/MesosModule.h.
> *NEXT*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-1809) Modify docker pull to use docker inspect after a successful pull

2014-09-18 Thread Timothy Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-1809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14139532#comment-14139532
 ] 

Timothy Chen commented on MESOS-1809:
-

commit 48db9a513fac0066c8f38aa98b8d893fdf298998
Author: Timothy Chen 
Date:   Thu Sep 18 02:11:40 2014 -0700

Modify Docker::pull to call inspect after pull.

Review: https://reviews.apache.org/r/25758


> Modify docker pull to use docker inspect after a successful pull
> 
>
> Key: MESOS-1809
> URL: https://issues.apache.org/jira/browse/MESOS-1809
> Project: Mesos
>  Issue Type: Bug
>Reporter: Timothy Chen
>Assignee: Timothy Chen
> Fix For: 0.20.1
>
>
> Currently in docker pull we read the stdout of pull to construct the docker 
> image object, however it contains extra output from stdout.
> We should docker inspect after pull instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-1675) Decouple version of the mesos library from the package release version


[ 
https://issues.apache.org/jira/browse/MESOS-1675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14139531#comment-14139531
 ] 

Timothy St. Clair commented on MESOS-1675:
--

Provided that they linked to libmesos.so, I don't believe so.

> Decouple version of the mesos library from the package release version
> --
>
> Key: MESOS-1675
> URL: https://issues.apache.org/jira/browse/MESOS-1675
> Project: Mesos
>  Issue Type: Bug
>Reporter: Vinod Kone
>
> This discussion should be rolled into the larger discussion around how to 
> version Mesos (APIs, packages, libraries etc).
> Some notes from libtool docs.
> http://www.gnu.org/software/libtool/manual/html_node/Updating-version-info.html
> http://www.gnu.org/software/libtool/manual/html_node/Release-numbers.html#Release-numbers



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-1809) Modify docker pull to use docker inspect after a successful pull

2014-09-18 Thread Adam B (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-1809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-1809:
--
Fix Version/s: 0.20.1

> Modify docker pull to use docker inspect after a successful pull
> 
>
> Key: MESOS-1809
> URL: https://issues.apache.org/jira/browse/MESOS-1809
> Project: Mesos
>  Issue Type: Bug
>Reporter: Timothy Chen
>Assignee: Timothy Chen
> Fix For: 0.20.1
>
>
> Currently in docker pull we read the stdout of pull to construct the docker 
> image object, however it contains extra output from stdout.
> We should docker inspect after pull instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (MESOS-1809) Modify docker pull to use docker inspect after a successful pull

2014-09-18 Thread Timothy Chen (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-1809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Chen resolved MESOS-1809.
-
Resolution: Fixed

> Modify docker pull to use docker inspect after a successful pull
> 
>
> Key: MESOS-1809
> URL: https://issues.apache.org/jira/browse/MESOS-1809
> Project: Mesos
>  Issue Type: Bug
>Reporter: Timothy Chen
>Assignee: Timothy Chen
>
> Currently in docker pull we read the stdout of pull to construct the docker 
> image object, however it contains extra output from stdout.
> We should docker inspect after pull instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-1384) Add support for loadable MesosModule


[ 
https://issues.apache.org/jira/browse/MESOS-1384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14139412#comment-14139412
 ] 

Timothy St. Clair commented on MESOS-1384:
--

Keep it simple for now, as I fully expect this to iterate over time.  

It's also auxiliary and nothing depends on it yet, so until that point happens 
there can be refinement. 

> Add support for loadable MesosModule
> 
>
> Key: MESOS-1384
> URL: https://issues.apache.org/jira/browse/MESOS-1384
> Project: Mesos
>  Issue Type: Improvement
>Affects Versions: 0.19.0
>Reporter: Timothy St. Clair
>Assignee: Niklas Quarfot Nielsen
>
> I think we should break this into multiple phases.
> -(1) Let's get the dynamic library loading via a "stout-ified" version of 
> https://github.com/timothysc/tests/blob/master/plugin_modules/DynamicLibrary.h.
>  -
> *DONE*
> (2) Use (1) to instantiate some classes in Mesos (like an Authenticator 
> and/or isolator) from a dynamic library. This will give us some more 
> experience with how we want to name the underlying library symbol, how we 
> want to specify flags for finding the library, what types of validation we 
> want when loading a library.
> *TARGET* 
> (3) After doing (2) for one or two classes in Mesos I think we can formalize 
> the approach in a "mesos-ified" version of 
> https://github.com/timothysc/tests/blob/master/plugin_modules/MesosModule.h.
> *NEXT*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (MESOS-1384) Add support for loadable MesosModule


[ 
https://issues.apache.org/jira/browse/MESOS-1384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14139390#comment-14139390
 ] 

Bernd Mathiske edited comment on MESOS-1384 at 9/18/14 7:57 PM:


[~tstclair] Thanks for the vote of confidence! We will can a code improvement 
pass now and also remove non-essentials to get to a minimal viable first patch. 

However, we still have to solve the question what the command line interface 
should look like. Go for JSON right away? On the command line? Or maybe this: 
keep the simple format (:,...) we have right now and 
also add a second flag that points at a JSON file? 





was (Author: bernd-mesos):
[~tstclair] Thanks for the vote of confidence! We will make a code improvement 
pass now and also remove non-essentials to get to a minimal viable first patch.

> Add support for loadable MesosModule
> 
>
> Key: MESOS-1384
> URL: https://issues.apache.org/jira/browse/MESOS-1384
> Project: Mesos
>  Issue Type: Improvement
>Affects Versions: 0.19.0
>Reporter: Timothy St. Clair
>Assignee: Niklas Quarfot Nielsen
>
> I think we should break this into multiple phases.
> -(1) Let's get the dynamic library loading via a "stout-ified" version of 
> https://github.com/timothysc/tests/blob/master/plugin_modules/DynamicLibrary.h.
>  -
> *DONE*
> (2) Use (1) to instantiate some classes in Mesos (like an Authenticator 
> and/or isolator) from a dynamic library. This will give us some more 
> experience with how we want to name the underlying library symbol, how we 
> want to specify flags for finding the library, what types of validation we 
> want when loading a library.
> *TARGET* 
> (3) After doing (2) for one or two classes in Mesos I think we can formalize 
> the approach in a "mesos-ified" version of 
> https://github.com/timothysc/tests/blob/master/plugin_modules/MesosModule.h.
> *NEXT*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (MESOS-1384) Add support for loadable MesosModule


[ 
https://issues.apache.org/jira/browse/MESOS-1384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14139390#comment-14139390
 ] 

Bernd Mathiske edited comment on MESOS-1384 at 9/18/14 7:57 PM:


[~tstclair] Thanks for the vote of confidence! We can a code improvement pass 
now and also remove non-essentials to get to a minimal viable first patch. 

However, we still have to solve the question what the command line interface 
should look like. Go for JSON right away? On the command line? Or maybe this: 
keep the simple format (:,...) we have right now and 
also add a second flag that points at a JSON file? 





was (Author: bernd-mesos):
[~tstclair] Thanks for the vote of confidence! We will can a code improvement 
pass now and also remove non-essentials to get to a minimal viable first patch. 

However, we still have to solve the question what the command line interface 
should look like. Go for JSON right away? On the command line? Or maybe this: 
keep the simple format (:,...) we have right now and 
also add a second flag that points at a JSON file? 




> Add support for loadable MesosModule
> 
>
> Key: MESOS-1384
> URL: https://issues.apache.org/jira/browse/MESOS-1384
> Project: Mesos
>  Issue Type: Improvement
>Affects Versions: 0.19.0
>Reporter: Timothy St. Clair
>Assignee: Niklas Quarfot Nielsen
>
> I think we should break this into multiple phases.
> -(1) Let's get the dynamic library loading via a "stout-ified" version of 
> https://github.com/timothysc/tests/blob/master/plugin_modules/DynamicLibrary.h.
>  -
> *DONE*
> (2) Use (1) to instantiate some classes in Mesos (like an Authenticator 
> and/or isolator) from a dynamic library. This will give us some more 
> experience with how we want to name the underlying library symbol, how we 
> want to specify flags for finding the library, what types of validation we 
> want when loading a library.
> *TARGET* 
> (3) After doing (2) for one or two classes in Mesos I think we can formalize 
> the approach in a "mesos-ified" version of 
> https://github.com/timothysc/tests/blob/master/plugin_modules/MesosModule.h.
> *NEXT*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-1384) Add support for loadable MesosModule


[ 
https://issues.apache.org/jira/browse/MESOS-1384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14139390#comment-14139390
 ] 

Bernd Mathiske commented on MESOS-1384:
---

[~tstclair] Thanks for the vote of confidence! We will make a code improvement 
pass now and also remove non-essentials to get to a minimal viable first patch.

> Add support for loadable MesosModule
> 
>
> Key: MESOS-1384
> URL: https://issues.apache.org/jira/browse/MESOS-1384
> Project: Mesos
>  Issue Type: Improvement
>Affects Versions: 0.19.0
>Reporter: Timothy St. Clair
>Assignee: Niklas Quarfot Nielsen
>
> I think we should break this into multiple phases.
> -(1) Let's get the dynamic library loading via a "stout-ified" version of 
> https://github.com/timothysc/tests/blob/master/plugin_modules/DynamicLibrary.h.
>  -
> *DONE*
> (2) Use (1) to instantiate some classes in Mesos (like an Authenticator 
> and/or isolator) from a dynamic library. This will give us some more 
> experience with how we want to name the underlying library symbol, how we 
> want to specify flags for finding the library, what types of validation we 
> want when loading a library.
> *TARGET* 
> (3) After doing (2) for one or two classes in Mesos I think we can formalize 
> the approach in a "mesos-ified" version of 
> https://github.com/timothysc/tests/blob/master/plugin_modules/MesosModule.h.
> *NEXT*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-1384) Add support for loadable MesosModule


[ 
https://issues.apache.org/jira/browse/MESOS-1384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14139349#comment-14139349
 ] 

Timothy St. Clair commented on MESOS-1384:
--

Folks - 

I think this is ready for review. 
You might want to make a couple of minor changes around named loading: e.g. 
libFoo.so, libFoo.dylib 
The load could check for extension, and in absence do the right thing. load 
(Foo)  

> Add support for loadable MesosModule
> 
>
> Key: MESOS-1384
> URL: https://issues.apache.org/jira/browse/MESOS-1384
> Project: Mesos
>  Issue Type: Improvement
>Affects Versions: 0.19.0
>Reporter: Timothy St. Clair
>Assignee: Niklas Quarfot Nielsen
>
> I think we should break this into multiple phases.
> -(1) Let's get the dynamic library loading via a "stout-ified" version of 
> https://github.com/timothysc/tests/blob/master/plugin_modules/DynamicLibrary.h.
>  -
> *DONE*
> (2) Use (1) to instantiate some classes in Mesos (like an Authenticator 
> and/or isolator) from a dynamic library. This will give us some more 
> experience with how we want to name the underlying library symbol, how we 
> want to specify flags for finding the library, what types of validation we 
> want when loading a library.
> *TARGET* 
> (3) After doing (2) for one or two classes in Mesos I think we can formalize 
> the approach in a "mesos-ified" version of 
> https://github.com/timothysc/tests/blob/master/plugin_modules/MesosModule.h.
> *NEXT*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-1808) expose RTT in container stats

2014-09-18 Thread Jie Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-1808:
--
Assignee: Chi Zhang  (was: Jie Yu)

> expose RTT in container stats
> -
>
> Key: MESOS-1808
> URL: https://issues.apache.org/jira/browse/MESOS-1808
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
>Reporter: Dominic Hamon
>Assignee: Chi Zhang
>
> As we expose the bandwidth, so we should expose the RTT as a measure of 
> latency each container is experiencing.
> We can use {{ss}} to get the per-socket statistics and filter and aggregate 
> accordingly to get a measure of RTT.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-1662) Mesos doesn't limit swap

2014-09-18 Thread Chi Hoang (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-1662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14139289#comment-14139289
 ] 

Chi Hoang commented on MESOS-1662:
--

awesome!  thanks!

> Mesos doesn't limit swap
> 
>
> Key: MESOS-1662
> URL: https://issues.apache.org/jira/browse/MESOS-1662
> Project: Mesos
>  Issue Type: Bug
>  Components: isolation
>Affects Versions: 0.19.1
>Reporter: Andrew Forgue
>Assignee: Anton Lindström
> Fix For: 0.20.0
>
>
> When using control groups, mesos will limit memory usage, but if the 
> CONFIG_MEMCG_SWAP config option is enabled swap usage is not limited.
> This means that if a task that asked for 1G and allocated 4G, it will fill 3G 
> of swap.  The expected behavior is that the cgroup should have OOMed.  The 
> control group key for limiting both Memory+Swap is 
> memory.memsw.limit_in_bytes (not memory.limit_in_bytes).  It looks like 
> CONFIG_MEMCG_SWAP showed up in Kernel 3.6.
> Mesos should limit swap+memory if possible.  I can't imagine when you'd want 
> to limit memory but not swap, but there may be some situations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-1815) Create a guide to becoming a committer

2014-09-18 Thread Dominic Hamon (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-1815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14139232#comment-14139232
 ] 

Dominic Hamon commented on MESOS-1815:
--

Please review at https://reviews.apache.org/r/25785/


> Create a guide to becoming a committer
> --
>
> Key: MESOS-1815
> URL: https://issues.apache.org/jira/browse/MESOS-1815
> Project: Mesos
>  Issue Type: Documentation
>  Components: documentation
>Reporter: Dominic Hamon
>Assignee: Dominic Hamon
>
> We have a committer's guide, but the process by which one becomes a committer 
> is unclear. We should set some guidelines and a process by which we can grow 
> contributors into committers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-1815) Create a guide to becoming a committer

2014-09-18 Thread Dominic Hamon (JIRA)

Dominic Hamon created MESOS-1815:


 Summary: Create a guide to becoming a committer
 Key: MESOS-1815
 URL: https://issues.apache.org/jira/browse/MESOS-1815
 Project: Mesos
  Issue Type: Documentation
  Components: documentation
Reporter: Dominic Hamon
Assignee: Dominic Hamon


We have a committer's guide, but the process by which one becomes a committer 
is unclear. We should set some guidelines and a process by which we can grow 
contributors into committers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-1814) Task attempted to use more offers than requested in example jave and python frameworks


 [ 
https://issues.apache.org/jira/browse/MESOS-1814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-1814:
--
  Component/s: test
   Sprint: Mesos Q3 Sprint 5
 Target Version/s: 0.21.0
Affects Version/s: 0.21.0
 Shepherd: Yan Xu
 Story Points: 2
  Summary: Task attempted to use more offers than requested in 
example jave and python frameworks  (was: Task attempted to use more offers 
than requested in example framework)

This is a latent bug in both the java and python example frameworks. Both these 
frameworks launch tasks without looking at whether the resources offered to it 
are enough to launch the task.

We are seeing this now because of the recently landed change that offers 
frameworks resources with no memory or no cpu. Before this change, no such 
offers were made and hence the framework was lucky that any offer that it 
received matched its task requirements.

I'll send a patch shortly.

> Task attempted to use more offers than requested in example jave and python 
> frameworks
> --
>
> Key: MESOS-1814
> URL: https://issues.apache.org/jira/browse/MESOS-1814
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.21.0
>Reporter: Vinod Kone
>Assignee: Vinod Kone
>
> {code}
> [ RUN  ] ExamplesTest.JavaFramework
> Using temporary directory '/tmp/ExamplesTest_JavaFramework_2PcFCh'
> Enabling authentication for the framework
> WARNING: Logging before InitGoogleLogging() is written to STDERR
> I0917 23:14:35.199069 31510 process.cpp:1771] libprocess is initialized on 
> 127.0.1.1:34609 for 8 cpus
> I0917 23:14:35.199794 31510 logging.cpp:177] Logging to STDERR
> I0917 23:14:35.225342 31510 leveldb.cpp:176] Opened db in 22.197149ms
> I0917 23:14:35.231133 31510 leveldb.cpp:183] Compacted db in 5.601897ms
> I0917 23:14:35.231498 31510 leveldb.cpp:198] Created db iterator in 215441ns
> I0917 23:14:35.231608 31510 leveldb.cpp:204] Seeked to beginning of db in 
> 11488ns
> I0917 23:14:35.231722 31510 leveldb.cpp:273] Iterated through 0 keys in the 
> db in 14016ns
> I0917 23:14:35.231917 31510 replica.cpp:741] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I0917 23:14:35.233129 31526 recover.cpp:425] Starting replica recovery
> I0917 23:14:35.233614 31526 recover.cpp:451] Replica is in EMPTY status
> I0917 23:14:35.234994 31526 replica.cpp:638] Replica in EMPTY status received 
> a broadcasted recover request
> I0917 23:14:35.240116 31519 recover.cpp:188] Received a recover response from 
> a replica in EMPTY status
> I0917 23:14:35.240782 31519 recover.cpp:542] Updating replica status to 
> STARTING
> I0917 23:14:35.242846 31524 master.cpp:286] Master 
> 20140917-231435-16842879-34609-31503 (saucy) started on 127.0.1.1:34609
> I0917 23:14:35.243191 31524 master.cpp:332] Master only allowing 
> authenticated frameworks to register
> I0917 23:14:35.243288 31524 master.cpp:339] Master allowing unauthenticated 
> slaves to register
> I0917 23:14:35.243399 31524 credentials.hpp:36] Loading credentials for 
> authentication from '/tmp/ExamplesTest_JavaFramework_2PcFCh/credentials'
> W0917 23:14:35.243588 31524 credentials.hpp:51] Permissions on credentials 
> file '/tmp/ExamplesTest_JavaFramework_2PcFCh/credentials' are too open. It is 
> recommended that your credentials file is NOT accessible by others.
> I0917 23:14:35.243846 31524 master.cpp:366] Authorization enabled
> I0917 23:14:35.244882 31520 hierarchical_allocator_process.hpp:299] 
> Initializing hierarchical allocator process with master : 
> master@127.0.1.1:34609
> I0917 23:14:35.245224 31520 master.cpp:120] No whitelist given. Advertising 
> offers for all slaves
> I0917 23:14:35.246934 31524 master.cpp:1211] The newly elected leader is 
> master@127.0.1.1:34609 with id 20140917-231435-16842879-34609-31503
> I0917 23:14:35.247234 31524 master.cpp:1224] Elected as the leading master!
> I0917 23:14:35.247336 31524 master.cpp:1042] Recovering from registrar
> I0917 23:14:35.247542 31526 registrar.cpp:313] Recovering registrar
> I0917 23:14:35.250555 31510 containerizer.cpp:89] Using isolation: 
> posix/cpu,posix/mem
> I0917 23:14:35.252326 31510 containerizer.cpp:89] Using isolation: 
> posix/cpu,posix/mem
> I0917 23:14:35.252821 31520 slave.cpp:169] Slave started on 1)@127.0.1.1:34609
> I0917 23:14:35.253552 31520 slave.cpp:289] Slave resources: cpus(*):1; 
> mem(*):1001; disk(*):24988; ports(*):[31000-32000]
> I0917 23:14:35.253906 31520 slave.cpp:317] Slave hostname: saucy
> I0917 23:14:35.254004 31520 slave.cpp:318] Slave checkpoint: true
> I0917 23:14:35.254818 31520 state.cpp:33] Recovering state from 
> '/tmp/mesos-w8snRW/0/meta'
> I0917 23:14:35.255106 31519 leveldb.cpp:306] Persist

[jira] [Updated] (MESOS-1662) Mesos doesn't limit swap

2014-09-18 Thread Jie Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-1662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-1662:
--
Fix Version/s: 0.20.0

> Mesos doesn't limit swap
> 
>
> Key: MESOS-1662
> URL: https://issues.apache.org/jira/browse/MESOS-1662
> Project: Mesos
>  Issue Type: Bug
>  Components: isolation
>Affects Versions: 0.19.1
>Reporter: Andrew Forgue
>Assignee: Anton Lindström
> Fix For: 0.20.0
>
>
> When using control groups, mesos will limit memory usage, but if the 
> CONFIG_MEMCG_SWAP config option is enabled swap usage is not limited.
> This means that if a task that asked for 1G and allocated 4G, it will fill 3G 
> of swap.  The expected behavior is that the cgroup should have OOMed.  The 
> control group key for limiting both Memory+Swap is 
> memory.memsw.limit_in_bytes (not memory.limit_in_bytes).  It looks like 
> CONFIG_MEMCG_SWAP showed up in Kernel 3.6.
> Mesos should limit swap+memory if possible.  I can't imagine when you'd want 
> to limit memory but not swap, but there may be some situations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-1662) Mesos doesn't limit swap

2014-09-18 Thread Jie Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-1662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14139181#comment-14139181
 ] 

Jie Yu commented on MESOS-1662:
---

[~chi_groupon] This is fixed in 0.20.0

commit cfa168f5136f05d01ebb5ffa3a42a118db14c43e
Author: Anton Lindström 
Date:   Sun Aug 10 16:26:46 2014 -0700

Allowed cgroups mem isolator to limit swap by setting
memory.memsw.limit_in_bytes.

Review: https://reviews.apache.org/r/24316

> Mesos doesn't limit swap
> 
>
> Key: MESOS-1662
> URL: https://issues.apache.org/jira/browse/MESOS-1662
> Project: Mesos
>  Issue Type: Bug
>  Components: isolation
>Affects Versions: 0.19.1
>Reporter: Andrew Forgue
>Assignee: Anton Lindström
> Fix For: 0.20.0
>
>
> When using control groups, mesos will limit memory usage, but if the 
> CONFIG_MEMCG_SWAP config option is enabled swap usage is not limited.
> This means that if a task that asked for 1G and allocated 4G, it will fill 3G 
> of swap.  The expected behavior is that the cgroup should have OOMed.  The 
> control group key for limiting both Memory+Swap is 
> memory.memsw.limit_in_bytes (not memory.limit_in_bytes).  It looks like 
> CONFIG_MEMCG_SWAP showed up in Kernel 3.6.
> Mesos should limit swap+memory if possible.  I can't imagine when you'd want 
> to limit memory but not swap, but there may be some situations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-1662) Mesos doesn't limit swap

2014-09-18 Thread Chi Hoang (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-1662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14139176#comment-14139176
 ] 

Chi Hoang commented on MESOS-1662:
--

Wondering what happened with this fix.  Status says fixed, but it wasn't 
included in 0.20.0.

> Mesos doesn't limit swap
> 
>
> Key: MESOS-1662
> URL: https://issues.apache.org/jira/browse/MESOS-1662
> Project: Mesos
>  Issue Type: Bug
>  Components: isolation
>Affects Versions: 0.19.1
>Reporter: Andrew Forgue
>Assignee: Anton Lindström
>
> When using control groups, mesos will limit memory usage, but if the 
> CONFIG_MEMCG_SWAP config option is enabled swap usage is not limited.
> This means that if a task that asked for 1G and allocated 4G, it will fill 3G 
> of swap.  The expected behavior is that the cgroup should have OOMed.  The 
> control group key for limiting both Memory+Swap is 
> memory.memsw.limit_in_bytes (not memory.limit_in_bytes).  It looks like 
> CONFIG_MEMCG_SWAP showed up in Kernel 3.6.
> Mesos should limit swap+memory if possible.  I can't imagine when you'd want 
> to limit memory but not swap, but there may be some situations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-1812) Queued tasks are not actually launched in the order they were queued

2014-09-18 Thread Dominic Hamon (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14139171#comment-14139171
 ] 

Dominic Hamon commented on MESOS-1812:
--

MESOS-497 doesn't have any reasoning other than "it would be nice" so I would 
also like to hear why this is important.

I'm not saying it isn't, just want to make sure we're not artificially adding 
constraints to the system.

> Queued tasks are not actually launched in the order they were queued
> 
>
> Key: MESOS-1812
> URL: https://issues.apache.org/jira/browse/MESOS-1812
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Reporter: Tom Arnfeld
>
> Even though tasks are assigned and queued in the order in which they are 
> launched (e.g multiple tasks in reply to one offer), due to timing issues 
> with the futures, this can sometimes break the causality and end up not being 
> launched in order.
> Example trace from a slave... In this example the Task_Tracker_10 task should 
> be launched before slots_Task_Tracker_10.
> {code}
> I0918 02:10:50.371445 17072 slave.cpp:933] Got assigned task Task_Tracker_10 
> for framework 20140916-233111-3171422218-5050-14295-0015
> I0918 02:10:50.372110 17072 slave.cpp:933] Got assigned task 
> slots_Task_Tracker_10 for framework 20140916-233111-3171422218-5050-14295-0015
> I0918 02:10:50.372172 17073 gc.cpp:84] Unscheduling 
> '/mnt/mesos-slave/slaves/20140915-112519-3171422218-5050-5016-6/frameworks/20140916-233111-3171422218-5050-14295-0015'
>  from gc
> I0918 02:10:50.375018 17072 slave.cpp:1043] Launching task 
> slots_Task_Tracker_10 for framework 20140916-233111-3171422218-5050-14295-0015
> I0918 02:10:50.386282 17072 slave.cpp:1153] Queuing task 
> 'slots_Task_Tracker_10' for executor executor_Task_Tracker_10 of framework 
> '20140916-233111-3171422218-5050-14295-0015
> I0918 02:10:50.386312 17070 mesos_containerizer.cpp:537] Starting container 
> '5f507f09-b48e-44ea-b74e-740b0e8bba4d' for executor 
> 'executor_Task_Tracker_10' of framework 
> '20140916-233111-3171422218-5050-14295-0015'
> I0918 02:10:50.388942 17072 slave.cpp:1043] Launching task Task_Tracker_10 
> for framework 20140916-233111-3171422218-5050-14295-0015
> I0918 02:10:50.406277 17070 launcher.cpp:117] Forked child with pid '817' for 
> container '5f507f09-b48e-44ea-b74e-740b0e8bba4d'
> I0918 02:10:50.406563 17072 slave.cpp:1153] Queuing task 'Task_Tracker_10' 
> for executor executor_Task_Tracker_10 of framework 
> '20140916-233111-3171422218-5050-14295-0015
> I0918 02:10:50.408499 17069 mesos_containerizer.cpp:647] Fetching URIs for 
> container '5f507f09-b48e-44ea-b74e-740b0e8bba4d' using command 
> '/usr/local/libexec/mesos/mesos-fetcher'
> I0918 02:11:11.650687 17071 slave.cpp:2873] Current usage 17.34%. Max allowed 
> age: 5.086371210668750days
> I0918 02:11:16.590270 17075 slave.cpp:2355] Monitoring executor 
> 'executor_Task_Tracker_10' of framework 
> '20140916-233111-3171422218-5050-14295-0015' in container 
> '5f507f09-b48e-44ea-b74e-740b0e8bba4d'
> I0918 02:11:17.701015 17070 slave.cpp:1664] Got registration for executor 
> 'executor_Task_Tracker_10' of framework 
> 20140916-233111-3171422218-5050-14295-0015
> I0918 02:11:17.701897 17070 slave.cpp:1783] Flushing queued task 
> slots_Task_Tracker_10 for executor 'executor_Task_Tracker_10' of framework 
> 20140916-233111-3171422218-5050-14295-0015
> I0918 02:11:17.702350 17070 slave.cpp:1783] Flushing queued task 
> Task_Tracker_10 for executor 'executor_Task_Tracker_10' of framework 
> 20140916-233111-3171422218-5050-14295-0015
> I0918 02:11:18.588388 17070 mesos_containerizer.cpp:1112] Executor for 
> container '5f507f09-b48e-44ea-b74e-740b0e8bba4d' has exited
> I0918 02:11:18.588665 17070 mesos_containerizer.cpp:996] Destroying container 
> '5f507f09-b48e-44ea-b74e-740b0e8bba4d'
> I0918 02:11:18.599234 17072 slave.cpp:2413] Executor 
> 'executor_Task_Tracker_10' of framework 
> 20140916-233111-3171422218-5050-14295-0015 has exited with status 1
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (MESOS-1814) Task attempted to use more offers than requested in example framework


 [ 
https://issues.apache.org/jira/browse/MESOS-1814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone reassigned MESOS-1814:
-

Assignee: Vinod Kone

> Task attempted to use more offers than requested in example framework
> -
>
> Key: MESOS-1814
> URL: https://issues.apache.org/jira/browse/MESOS-1814
> Project: Mesos
>  Issue Type: Bug
>Reporter: Vinod Kone
>Assignee: Vinod Kone
>
> {code}
> [ RUN  ] ExamplesTest.JavaFramework
> Using temporary directory '/tmp/ExamplesTest_JavaFramework_2PcFCh'
> Enabling authentication for the framework
> WARNING: Logging before InitGoogleLogging() is written to STDERR
> I0917 23:14:35.199069 31510 process.cpp:1771] libprocess is initialized on 
> 127.0.1.1:34609 for 8 cpus
> I0917 23:14:35.199794 31510 logging.cpp:177] Logging to STDERR
> I0917 23:14:35.225342 31510 leveldb.cpp:176] Opened db in 22.197149ms
> I0917 23:14:35.231133 31510 leveldb.cpp:183] Compacted db in 5.601897ms
> I0917 23:14:35.231498 31510 leveldb.cpp:198] Created db iterator in 215441ns
> I0917 23:14:35.231608 31510 leveldb.cpp:204] Seeked to beginning of db in 
> 11488ns
> I0917 23:14:35.231722 31510 leveldb.cpp:273] Iterated through 0 keys in the 
> db in 14016ns
> I0917 23:14:35.231917 31510 replica.cpp:741] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I0917 23:14:35.233129 31526 recover.cpp:425] Starting replica recovery
> I0917 23:14:35.233614 31526 recover.cpp:451] Replica is in EMPTY status
> I0917 23:14:35.234994 31526 replica.cpp:638] Replica in EMPTY status received 
> a broadcasted recover request
> I0917 23:14:35.240116 31519 recover.cpp:188] Received a recover response from 
> a replica in EMPTY status
> I0917 23:14:35.240782 31519 recover.cpp:542] Updating replica status to 
> STARTING
> I0917 23:14:35.242846 31524 master.cpp:286] Master 
> 20140917-231435-16842879-34609-31503 (saucy) started on 127.0.1.1:34609
> I0917 23:14:35.243191 31524 master.cpp:332] Master only allowing 
> authenticated frameworks to register
> I0917 23:14:35.243288 31524 master.cpp:339] Master allowing unauthenticated 
> slaves to register
> I0917 23:14:35.243399 31524 credentials.hpp:36] Loading credentials for 
> authentication from '/tmp/ExamplesTest_JavaFramework_2PcFCh/credentials'
> W0917 23:14:35.243588 31524 credentials.hpp:51] Permissions on credentials 
> file '/tmp/ExamplesTest_JavaFramework_2PcFCh/credentials' are too open. It is 
> recommended that your credentials file is NOT accessible by others.
> I0917 23:14:35.243846 31524 master.cpp:366] Authorization enabled
> I0917 23:14:35.244882 31520 hierarchical_allocator_process.hpp:299] 
> Initializing hierarchical allocator process with master : 
> master@127.0.1.1:34609
> I0917 23:14:35.245224 31520 master.cpp:120] No whitelist given. Advertising 
> offers for all slaves
> I0917 23:14:35.246934 31524 master.cpp:1211] The newly elected leader is 
> master@127.0.1.1:34609 with id 20140917-231435-16842879-34609-31503
> I0917 23:14:35.247234 31524 master.cpp:1224] Elected as the leading master!
> I0917 23:14:35.247336 31524 master.cpp:1042] Recovering from registrar
> I0917 23:14:35.247542 31526 registrar.cpp:313] Recovering registrar
> I0917 23:14:35.250555 31510 containerizer.cpp:89] Using isolation: 
> posix/cpu,posix/mem
> I0917 23:14:35.252326 31510 containerizer.cpp:89] Using isolation: 
> posix/cpu,posix/mem
> I0917 23:14:35.252821 31520 slave.cpp:169] Slave started on 1)@127.0.1.1:34609
> I0917 23:14:35.253552 31520 slave.cpp:289] Slave resources: cpus(*):1; 
> mem(*):1001; disk(*):24988; ports(*):[31000-32000]
> I0917 23:14:35.253906 31520 slave.cpp:317] Slave hostname: saucy
> I0917 23:14:35.254004 31520 slave.cpp:318] Slave checkpoint: true
> I0917 23:14:35.254818 31520 state.cpp:33] Recovering state from 
> '/tmp/mesos-w8snRW/0/meta'
> I0917 23:14:35.255106 31519 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 13.99622ms
> I0917 23:14:35.255235 31519 replica.cpp:320] Persisted replica status to 
> STARTING
> I0917 23:14:35.255419 31519 recover.cpp:451] Replica is in STARTING status
> I0917 23:14:35.255834 31519 replica.cpp:638] Replica in STARTING status 
> received a broadcasted recover request
> I0917 23:14:35.256000 31519 recover.cpp:188] Received a recover response from 
> a replica in STARTING status
> I0917 23:14:35.256217 31519 recover.cpp:542] Updating replica status to VOTING
> I0917 23:14:35.256641 31520 status_update_manager.cpp:193] Recovering status 
> update manager
> I0917 23:14:35.257064 31520 containerizer.cpp:252] Recovering containerizer
> I0917 23:14:35.257725 31520 slave.cpp:3220] Finished recovery
> I0917 23:14:35.258463 31520 slave.cpp:600] New master detected at 
> master@127.0.1.1:34609
> I0917 23:14:35.258769 31524 status_update_manager.cpp:167] New

[jira] [Created] (MESOS-1814) Task attempted to use more offers than requested in example framework

Vinod Kone created MESOS-1814:
-

 Summary: Task attempted to use more offers than requested in 
example framework
 Key: MESOS-1814
 URL: https://issues.apache.org/jira/browse/MESOS-1814
 Project: Mesos
  Issue Type: Bug
Reporter: Vinod Kone


{code}
[ RUN  ] ExamplesTest.JavaFramework
Using temporary directory '/tmp/ExamplesTest_JavaFramework_2PcFCh'
Enabling authentication for the framework
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0917 23:14:35.199069 31510 process.cpp:1771] libprocess is initialized on 
127.0.1.1:34609 for 8 cpus
I0917 23:14:35.199794 31510 logging.cpp:177] Logging to STDERR
I0917 23:14:35.225342 31510 leveldb.cpp:176] Opened db in 22.197149ms
I0917 23:14:35.231133 31510 leveldb.cpp:183] Compacted db in 5.601897ms
I0917 23:14:35.231498 31510 leveldb.cpp:198] Created db iterator in 215441ns
I0917 23:14:35.231608 31510 leveldb.cpp:204] Seeked to beginning of db in 
11488ns
I0917 23:14:35.231722 31510 leveldb.cpp:273] Iterated through 0 keys in the db 
in 14016ns
I0917 23:14:35.231917 31510 replica.cpp:741] Replica recovered with log 
positions 0 -> 0 with 1 holes and 0 unlearned
I0917 23:14:35.233129 31526 recover.cpp:425] Starting replica recovery
I0917 23:14:35.233614 31526 recover.cpp:451] Replica is in EMPTY status
I0917 23:14:35.234994 31526 replica.cpp:638] Replica in EMPTY status received a 
broadcasted recover request
I0917 23:14:35.240116 31519 recover.cpp:188] Received a recover response from a 
replica in EMPTY status
I0917 23:14:35.240782 31519 recover.cpp:542] Updating replica status to STARTING
I0917 23:14:35.242846 31524 master.cpp:286] Master 
20140917-231435-16842879-34609-31503 (saucy) started on 127.0.1.1:34609
I0917 23:14:35.243191 31524 master.cpp:332] Master only allowing authenticated 
frameworks to register
I0917 23:14:35.243288 31524 master.cpp:339] Master allowing unauthenticated 
slaves to register
I0917 23:14:35.243399 31524 credentials.hpp:36] Loading credentials for 
authentication from '/tmp/ExamplesTest_JavaFramework_2PcFCh/credentials'
W0917 23:14:35.243588 31524 credentials.hpp:51] Permissions on credentials file 
'/tmp/ExamplesTest_JavaFramework_2PcFCh/credentials' are too open. It is 
recommended that your credentials file is NOT accessible by others.
I0917 23:14:35.243846 31524 master.cpp:366] Authorization enabled
I0917 23:14:35.244882 31520 hierarchical_allocator_process.hpp:299] 
Initializing hierarchical allocator process with master : master@127.0.1.1:34609
I0917 23:14:35.245224 31520 master.cpp:120] No whitelist given. Advertising 
offers for all slaves
I0917 23:14:35.246934 31524 master.cpp:1211] The newly elected leader is 
master@127.0.1.1:34609 with id 20140917-231435-16842879-34609-31503
I0917 23:14:35.247234 31524 master.cpp:1224] Elected as the leading master!
I0917 23:14:35.247336 31524 master.cpp:1042] Recovering from registrar
I0917 23:14:35.247542 31526 registrar.cpp:313] Recovering registrar
I0917 23:14:35.250555 31510 containerizer.cpp:89] Using isolation: 
posix/cpu,posix/mem
I0917 23:14:35.252326 31510 containerizer.cpp:89] Using isolation: 
posix/cpu,posix/mem
I0917 23:14:35.252821 31520 slave.cpp:169] Slave started on 1)@127.0.1.1:34609
I0917 23:14:35.253552 31520 slave.cpp:289] Slave resources: cpus(*):1; 
mem(*):1001; disk(*):24988; ports(*):[31000-32000]
I0917 23:14:35.253906 31520 slave.cpp:317] Slave hostname: saucy
I0917 23:14:35.254004 31520 slave.cpp:318] Slave checkpoint: true
I0917 23:14:35.254818 31520 state.cpp:33] Recovering state from 
'/tmp/mesos-w8snRW/0/meta'
I0917 23:14:35.255106 31519 leveldb.cpp:306] Persisting metadata (8 bytes) to 
leveldb took 13.99622ms
I0917 23:14:35.255235 31519 replica.cpp:320] Persisted replica status to 
STARTING
I0917 23:14:35.255419 31519 recover.cpp:451] Replica is in STARTING status
I0917 23:14:35.255834 31519 replica.cpp:638] Replica in STARTING status 
received a broadcasted recover request
I0917 23:14:35.256000 31519 recover.cpp:188] Received a recover response from a 
replica in STARTING status
I0917 23:14:35.256217 31519 recover.cpp:542] Updating replica status to VOTING
I0917 23:14:35.256641 31520 status_update_manager.cpp:193] Recovering status 
update manager
I0917 23:14:35.257064 31520 containerizer.cpp:252] Recovering containerizer
I0917 23:14:35.257725 31520 slave.cpp:3220] Finished recovery
I0917 23:14:35.258463 31520 slave.cpp:600] New master detected at 
master@127.0.1.1:34609
I0917 23:14:35.258769 31524 status_update_manager.cpp:167] New master detected 
at master@127.0.1.1:34609
I0917 23:14:35.258885 31520 slave.cpp:636] No credentials provided. Attempting 
to register without authentication
I0917 23:14:35.259024 31520 slave.cpp:647] Detecting new master
I0917 23:14:35.259863 31520 slave.cpp:169] Slave started on 2)@127.0.1.1:34609
I0917 23:14:35.260288 31520 slave.cpp:289] Slave resources: cpus(*):1; 
mem(*):1001; disk(*):24988;

[jira] [Commented] (MESOS-1392) Failure when znode is removed before we can read its contents.

2014-09-18 Thread Yan Xu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-1392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14139147#comment-14139147
 ] 

Yan Xu commented on MESOS-1392:
---

Thanks [~jaybuff] for reminding me to close this ticket!

> Failure when znode is removed before we can read its contents.
> --
>
> Key: MESOS-1392
> URL: https://issues.apache.org/jira/browse/MESOS-1392
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.19.0
>Reporter: Benjamin Mahler
>Assignee: Yan Xu
> Fix For: 0.21.0
>
>
> Looks like the following can occur when a znode goes away right before we can 
> read it's contents:
> {noformat: title=Slave exit}
> I0520 16:33:45.721727 29155 group.cpp:382] Trying to create path 
> '/home/mesos/test/master' in ZooKeeper
> I0520 16:33:48.600837 29155 detector.cpp:134] Detected a new leader: 
> (id='2617')
> I0520 16:33:48.601428 29147 group.cpp:655] Trying to get 
> '/home/mesos/test/master/info_002617' in ZooKeeper
> Failed to detect a master: Failed to get data for ephemeral node 
> '/home/mesos/test/master/info_002617' in ZooKeeper: no node
> Slave Exit Status: 1
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-1384) Add support for loadable MesosModule


[ 
https://issues.apache.org/jira/browse/MESOS-1384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14139097#comment-14139097
 ] 

Timothy St. Clair commented on MESOS-1384:
--

Having a pluggable architecture would enable folks to do the following: 

1. Test PoC ideas in a clean way without impacting mainline.
2. Enable Service providers to write custom interfaces that may only apply to 
their workflow.  *This is the big one*
3. Prevents mesos from accreating too much into it's core without having well 
thought out boundaries on interfaces and adaptability over time.  By forcing 
the step, it helps to define clear boundaries. 
...  

> Add support for loadable MesosModule
> 
>
> Key: MESOS-1384
> URL: https://issues.apache.org/jira/browse/MESOS-1384
> Project: Mesos
>  Issue Type: Improvement
>Affects Versions: 0.19.0
>Reporter: Timothy St. Clair
>Assignee: Niklas Quarfot Nielsen
>
> I think we should break this into multiple phases.
> -(1) Let's get the dynamic library loading via a "stout-ified" version of 
> https://github.com/timothysc/tests/blob/master/plugin_modules/DynamicLibrary.h.
>  -
> *DONE*
> (2) Use (1) to instantiate some classes in Mesos (like an Authenticator 
> and/or isolator) from a dynamic library. This will give us some more 
> experience with how we want to name the underlying library symbol, how we 
> want to specify flags for finding the library, what types of validation we 
> want when loading a library.
> *TARGET* 
> (3) After doing (2) for one or two classes in Mesos I think we can formalize 
> the approach in a "mesos-ified" version of 
> https://github.com/timothysc/tests/blob/master/plugin_modules/MesosModule.h.
> *NEXT*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-1392) Failure when znode is removed before we can read its contents.

2014-09-18 Thread Jay Buffington (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-1392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14139098#comment-14139098
 ] 

Jay Buffington commented on MESOS-1392:
---

Looks like this is resolved by this commit: 
https://github.com/apache/mesos/commit/14c605e8ce425ec8c517d8e4f899eb3ddeede56a

> Failure when znode is removed before we can read its contents.
> --
>
> Key: MESOS-1392
> URL: https://issues.apache.org/jira/browse/MESOS-1392
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.19.0
>Reporter: Benjamin Mahler
>Assignee: Yan Xu
>
> Looks like the following can occur when a znode goes away right before we can 
> read it's contents:
> {noformat: title=Slave exit}
> I0520 16:33:45.721727 29155 group.cpp:382] Trying to create path 
> '/home/mesos/test/master' in ZooKeeper
> I0520 16:33:48.600837 29155 detector.cpp:134] Detected a new leader: 
> (id='2617')
> I0520 16:33:48.601428 29147 group.cpp:655] Trying to get 
> '/home/mesos/test/master/info_002617' in ZooKeeper
> Failed to detect a master: Failed to get data for ephemeral node 
> '/home/mesos/test/master/info_002617' in ZooKeeper: no node
> Slave Exit Status: 1
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-1806) Substituting etcd or ReplicatedLog for Zookeeper


[ 
https://issues.apache.org/jira/browse/MESOS-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14139077#comment-14139077
 ] 

Timothy St. Clair commented on MESOS-1806:
--

[~tnachen] got a branch?  
I'm game for assist, and I'm sure the folks on your end are looking to resolve 
the delta between kube. 


> Substituting etcd or ReplicatedLog for Zookeeper
> 
>
> Key: MESOS-1806
> URL: https://issues.apache.org/jira/browse/MESOS-1806
> Project: Mesos
>  Issue Type: Task
>Reporter: Ed Ropple
>Priority: Minor
>
>eropple: Could you also file a new JIRA for Mesos to drop ZK 
> in favor of etcd or ReplicatedLog? Would love to get some momentum going on 
> that one.
> --
> Consider it filed. =)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-1195) systemd.slice + cgroup enablement fails in multiple ways.