[ 
https://issues.apache.org/jira/browse/MESOS-3595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van Remoortere updated MESOS-3595:
----------------------------------------
    Sprint: Mesosphere Sprint 24

> Framework process hangs after master failover when number frameworks > 
> libprocess thread pool size
> --------------------------------------------------------------------------------------------------
>
>                 Key: MESOS-3595
>                 URL: https://issues.apache.org/jira/browse/MESOS-3595
>             Project: Mesos
>          Issue Type: Bug
>          Components: scheduler driver
>    Affects Versions: 0.24.1
>            Reporter: Mandeep Chadha
>            Assignee: Mandeep Chadha
>              Labels: mesosphere
>
> When running multi framework instances per process, if the number of 
> framework created exceeds the libprocess threads then during master failover 
> the zookeeper updates can cause deadlock. E.g. On a machine with 24 cpus, if 
> the framework instance count exceeds 24 ( per process)  then when the master 
> fails over all the libprocess threads block updating the cache ( 
> GroupProcess) leading to deadlock. Below is the stack trace of one the 
> libprocess thread :
> {code}
> Thread 101 (Thread 0x7f42821f1700 (LWP 5974)):
> #0  0x000000314100b5bc in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib64/libpthread.so.0
> #1  0x00007f42870d1637 in Gate::arrive(long) () from 
> /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_mesos.so
> #2  0x00007f42870be87c in process::ProcessManager::wait(process::UPID const&) 
> () from 
> /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.eg
> g/mesos/native/_mesos.so
> #3  0x00007f42870c25f7 in process::wait(process::UPID const&, Duration 
> const&) () from 
> /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.e
> gg/mesos/native/_mesos.so
> #4  0x00007f428708e294 in process::Latch::await(Duration const&) () from 
> /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/nativ
> e/_mesos.so
> #5  0x00007f4286b67dea in process::Future<int>::await(Duration const&) const 
> () from 
> /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg
> /mesos/native/_mesos.so
> #6  0x00007f4286b5a0df in process::Future<int>::get() const () from 
> /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_me
> sos.so
> #7  0x00007f4286ff0508 in ZooKeeper::getChildren(std::basic_string<char, 
> std::char_traits<char>, std::allocator<char> > const&, bool, 
> std::vector<std::basic_string<char, std::cha
> r_traits<char>, std::allocator<char> >, 
> std::allocator<std::basic_string<char, std::char_traits<char>, 
> std::allocator<char> > > >*) () from /Users/mchadha/venv/lib/python2.7/site
> -packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_mesos.so
> #8  0x00007f4286cb394e in zookeeper::GroupProcess::cache() () from 
> /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_mes
> os.so
> #9  0x00007f4286cb1e63 in zookeeper::GroupProcess::updated(long, 
> std::basic_string<char, std::char_traits<char>, std::allocator<char> > 
> const&) () from /Users/mchadha/venv/lib/py
> thon2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_mesos.so
> #10 0x00007f4286ce027a in std::tr1::_Mem_fn<void 
> (zookeeper::GroupProcess::*)(long, std::basic_string<char, 
> std::char_traits<char>, std::allocator<char> > const&)>::operator()(zo
> okeeper::GroupProcess*, long, std::basic_string<char, std::char_traits<char>, 
> std::allocator<char> > const&) const () from 
> /Users/mchadha/venv/lib/python2.7/site-packages/mesos.n
> ative-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_mesos.so
> #11 0x00007f4286ce0067 in std::tr1::result_of<std::tr1::_Mem_fn<void 
> (zookeeper::GroupProcess::*)(long, std::basic_string<char, 
> std::char_traits<char>, std::allocator<char> > con
> st&)> ()(std::tr1::result_of<std::tr1::_Mu<std::tr1::_Placeholder<1>, false, 
> true> ()(std::tr1::_Placeholder<1>, 
> std::tr1::tuple<zookeeper::GroupProcess*&>)>::type, std::tr1::res
> ult_of<std::tr1::_Mu<long, false, false> ()(long, 
> std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> 
> ()(std::tr1::_Placeholder<1>, std::tr1::tuple<zookeeper::GroupProcess*&>))
> >::type, std::tr1::result_of<std::tr1::_Mu<std::basic_string<char, 
> >std::char_traits<char>, std::allocator<char> >, false, false> 
> >()(std::basic_string<char, std::char_traits<char>
> , std::allocator<char> >, std::tr1::_Mu<std::tr1::_Placeholder<1>, false, 
> true> ()(std::tr1::_Placeholder<1>, 
> std::tr1::tuple<zookeeper::GroupProcess*&>))>::type)>::type std::tr1
> ::_Bind<std::tr1::_Mem_fn<void (zookeeper::GroupProcess::*)(long, 
> std::basic_string<char, std::char_traits<char>, std::allocator<char> > 
> const&)> ()(std::tr1::_Placeholder<1>, lo
> ng, std::basic_string<char, std::char_traits<char>, std::allocator<char> 
> >)>::__call<zookeeper::GroupProcess*&, 0, 1, 
> 2>(std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> ( c
> onst&)(std::tr1::_Placeholder<1>, 
> std::tr1::tuple<zookeeper::GroupProcess*&>), std::tr1::_Index_tuple<0, 1, 2>) 
> () from /Users/mchadha/venv/lib/python2.7/site-packages/mesos.nati
> ve-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_mesos.so
> #12 0x00007f4286cdfd16 in std::tr1::result_of<std::tr1::_Mem_fn<void 
> (zookeeper::GroupProcess::*)(long, std::basic_string<char, 
> std::char_traits<char>, std::allocator<char> > con
> st&)> ()(std::tr1::result_of<std::tr1::_Mu<std::tr1::_Placeholder<1>, false, 
> true> ()(std::tr1::_Placeholder<1>, 
> std::tr1::tuple<zookeeper::GroupProcess*>)>::type, std::tr1::resu
> lt_of<std::tr1::_Mu<long, false, false> ()(long, 
> std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> 
> ()(std::tr1::_Placeholder<1>, std::tr1::tuple<zookeeper::GroupProcess*>))>:
> :type, std::tr1::result_of<std::tr1::_Mu<std::basic_string<char, 
> std::char_traits<char>, std::allocator<char> >, false, false> 
> ()(std::basic_string<char, std::char_traits<char>,
> std::allocator<char> >, std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> 
> ()(std::tr1::_Placeholder<1>, 
> std::tr1::tuple<zookeeper::GroupProcess*>))>::type)>::type std::tr1::_
> Bind<std::tr1::_Mem_fn<void (zookeeper::GroupProcess::*)(long, 
> std::basic_string<char, std::char_traits<char>, std::allocator<char> > 
> const&)> ()(std::tr1::_Placeholder<1>, long,
>  std::basic_string<char, std::char_traits<char>, std::allocator<char> 
> >)>::operator()<zookeeper::GroupProcess*>(zookeeper::GroupProcess*&) () from 
> /Users/mchadha/venv/lib/python2
> .7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_mesos.so
> #13 0x00007f4286cdf8be in std::tr1::_Function_handler<void 
> ()(zookeeper::GroupProcess*), std::tr1::_Bind<std::tr1::_Mem_fn<void 
> (zookeeper::GroupProcess::*)(long, std::basic_stri
> ng<char, std::char_traits<char>, std::allocator<char> > const&)> 
> ()(std::tr1::_Placeholder<1>, long, std::basic_string<char, 
> std::char_traits<char>, std::allocator<char> >)> >::_
> M_invoke(std::tr1::_Any_data const&, zookeeper::GroupProcess*) () from 
> /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/
> _mesos.so
> #14 0x00007f4286cc2394 in std::tr1::function<void 
> ()(zookeeper::GroupProcess*)>::operator()(zookeeper::GroupProcess*) const () 
> from /Users/mchadha/venv/lib/python2.7/site-package
> s/mesos.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_mesos.so
> #15 0x00007f4286cbc3a2 in void 
> process::internal::vdispatcher<zookeeper::GroupProcess>(process::ProcessBase*,
>  std::tr1::shared_ptr<std::tr1::function<void ()(zookeeper::GroupProc
> ess*)> >) () from 
> /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_mesos.so
> #16 0x00007f4286ccdca5 in std::tr1::result_of<void 
> (*()(std::tr1::result_of<std::tr1::_Mu<std::tr1::_Placeholder<1>, false, 
> true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple<pr
> ocess::ProcessBase*&>)>::type, 
> std::tr1::result_of<std::tr1::_Mu<std::tr1::shared_ptr<std::tr1::function<void
>  ()(zookeeper::GroupProcess*)> >, false, false> ()(std::tr1::shared_p
> tr<std::tr1::function<void ()(zookeeper::GroupProcess*)> >, 
> std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> 
> ()(std::tr1::_Placeholder<1>, std::tr1::tuple<process::ProcessBa
> se*&>))>::type))(process::ProcessBase*, 
> std::tr1::shared_ptr<std::tr1::function<void ()(zookeeper::GroupProcess*)> 
> >)>::type std::tr1::_Bind<void (*()(std::tr1::_Placeholder<1>,
> std::tr1::shared_ptr<std::tr1::function<void ()(zookeeper::GroupProcess*)> 
> >))(process::ProcessBase*, std::tr1::shared_ptr<std::tr1::function<void 
> ()(zookeeper::GroupProcess*)> >
> )>::__call<process::ProcessBase*&, 0, 
> 1>(std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> ( 
> const&)(std::tr1::_Placeholder<1>, std::tr1::tuple<process::ProcessBase*&>), 
> std:
> :tr1::_Index_tuple<0, 1>) () from 
> /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_mesos.so
> #17 0x00007f4286cc7a5a in std::tr1::result_of<void 
> (*()(std::tr1::result_of<std::tr1::_Mu<std::tr1::_Placeholder<1>, false, 
> true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple<pr
> ocess::ProcessBase*>)>::type, 
> std::tr1::result_of<std::tr1::_Mu<std::tr1::shared_ptr<std::tr1::function<void
>  ()(zookeeper::GroupProcess*)> >, false, false> ()(std::tr1::shared_pt
> r<std::tr1::function<void ()(zookeeper::GroupProcess*)> >, 
> std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> 
> ()(std::tr1::_Placeholder<1>, std::tr1::tuple<process::ProcessBas
> e*>))>::type))(process::ProcessBase*, 
> std::tr1::shared_ptr<std::tr1::function<void ()(zookeeper::GroupProcess*)> 
> >)>::type std::tr1::_Bind<void (*()(std::tr1::_Placeholder<1>, st
> d::tr1::shared_ptr<std::tr1::function<void ()(zookeeper::GroupProcess*)> 
> >))(process::ProcessBase*, std::tr1::shared_ptr<std::tr1::function<void 
> ()(zookeeper::GroupProcess*)> >)>
> ::operator()<process::ProcessBase*>(process::ProcessBase*&) () from 
> /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_me
> sos.so
> #18 0x00007f4286cc2480 in std::tr1::_Function_handler<void 
> ()(process::ProcessBase*), std::tr1::_Bind<void 
> (*()(std::tr1::_Placeholder<1>, std::tr1::shared_ptr<std::tr1::function
> <void ()(zookeeper::GroupProcess*)> >))(process::ProcessBase*, 
> std::tr1::shared_ptr<std::tr1::function<void ()(zookeeper::GroupProcess*)> 
> >)> >::_M_invoke(std::tr1::_Any_data con
> st&, process::ProcessBase*) () from 
> /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_mesos.so
> #19 0x00007f42870db546 in std::tr1::function<void 
> ()(process::ProcessBase*)>::operator()(process::ProcessBase*) const () from 
> /Users/mchadha/venv/lib/python2.7/site-packages/meso
> s.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_mesos.so
> #20 0x00007f42870c1013 in process::ProcessBase::visit(process::DispatchEvent 
> const&) () from 
> /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x8
> 6_64.egg/mesos/native/_mesos.so
> #21 0x00007f42870c5582 in 
> process::DispatchEvent::visit(process::EventVisitor*) const () from 
> /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x
> 86_64.egg/mesos/native/_mesos.so
> #22 0x00007f428666680e in process::ProcessBase::serve(process::Event const&) 
> () from 
> /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg
> /mesos/native/_mesos.so
> #23 0x00007f42870bd88f in 
> process::ProcessManager::resume(process::ProcessBase*) () from 
> /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64
> .egg/mesos/native/_mesos.so
> #24 0x00007f42870b1cb9 in process::schedule(void*) () from 
> /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_mesos.so
> #25 0x00000031410079d1 in start_thread () from /lib64/libpthread.so.0
> #26 0x00000031408e88fd in clone () from /lib64/libc.so.6
> {code}
> Solution: 
>  Create master detector per url instead of per framework.
> Will send the review request. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to