[ https://issues.apache.org/jira/browse/MESOS-3595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joris Van Remoortere updated MESOS-3595: ---------------------------------------- Sprint: Mesosphere Sprint 24 > Framework process hangs after master failover when number frameworks > > libprocess thread pool size > -------------------------------------------------------------------------------------------------- > > Key: MESOS-3595 > URL: https://issues.apache.org/jira/browse/MESOS-3595 > Project: Mesos > Issue Type: Bug > Components: scheduler driver > Affects Versions: 0.24.1 > Reporter: Mandeep Chadha > Assignee: Mandeep Chadha > Labels: mesosphere > > When running multi framework instances per process, if the number of > framework created exceeds the libprocess threads then during master failover > the zookeeper updates can cause deadlock. E.g. On a machine with 24 cpus, if > the framework instance count exceeds 24 ( per process) then when the master > fails over all the libprocess threads block updating the cache ( > GroupProcess) leading to deadlock. Below is the stack trace of one the > libprocess thread : > {code} > Thread 101 (Thread 0x7f42821f1700 (LWP 5974)): > #0 0x000000314100b5bc in pthread_cond_wait@@GLIBC_2.3.2 () from > /lib64/libpthread.so.0 > #1 0x00007f42870d1637 in Gate::arrive(long) () from > /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_mesos.so > #2 0x00007f42870be87c in process::ProcessManager::wait(process::UPID const&) > () from > /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.eg > g/mesos/native/_mesos.so > #3 0x00007f42870c25f7 in process::wait(process::UPID const&, Duration > const&) () from > /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.e > gg/mesos/native/_mesos.so > #4 0x00007f428708e294 in process::Latch::await(Duration const&) () from > /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/nativ > e/_mesos.so > #5 0x00007f4286b67dea in process::Future<int>::await(Duration const&) const > () from > /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg > /mesos/native/_mesos.so > #6 0x00007f4286b5a0df in process::Future<int>::get() const () from > /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_me > sos.so > #7 0x00007f4286ff0508 in ZooKeeper::getChildren(std::basic_string<char, > std::char_traits<char>, std::allocator<char> > const&, bool, > std::vector<std::basic_string<char, std::cha > r_traits<char>, std::allocator<char> >, > std::allocator<std::basic_string<char, std::char_traits<char>, > std::allocator<char> > > >*) () from /Users/mchadha/venv/lib/python2.7/site > -packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_mesos.so > #8 0x00007f4286cb394e in zookeeper::GroupProcess::cache() () from > /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_mes > os.so > #9 0x00007f4286cb1e63 in zookeeper::GroupProcess::updated(long, > std::basic_string<char, std::char_traits<char>, std::allocator<char> > > const&) () from /Users/mchadha/venv/lib/py > thon2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_mesos.so > #10 0x00007f4286ce027a in std::tr1::_Mem_fn<void > (zookeeper::GroupProcess::*)(long, std::basic_string<char, > std::char_traits<char>, std::allocator<char> > const&)>::operator()(zo > okeeper::GroupProcess*, long, std::basic_string<char, std::char_traits<char>, > std::allocator<char> > const&) const () from > /Users/mchadha/venv/lib/python2.7/site-packages/mesos.n > ative-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_mesos.so > #11 0x00007f4286ce0067 in std::tr1::result_of<std::tr1::_Mem_fn<void > (zookeeper::GroupProcess::*)(long, std::basic_string<char, > std::char_traits<char>, std::allocator<char> > con > st&)> ()(std::tr1::result_of<std::tr1::_Mu<std::tr1::_Placeholder<1>, false, > true> ()(std::tr1::_Placeholder<1>, > std::tr1::tuple<zookeeper::GroupProcess*&>)>::type, std::tr1::res > ult_of<std::tr1::_Mu<long, false, false> ()(long, > std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> > ()(std::tr1::_Placeholder<1>, std::tr1::tuple<zookeeper::GroupProcess*&>)) > >::type, std::tr1::result_of<std::tr1::_Mu<std::basic_string<char, > >std::char_traits<char>, std::allocator<char> >, false, false> > >()(std::basic_string<char, std::char_traits<char> > , std::allocator<char> >, std::tr1::_Mu<std::tr1::_Placeholder<1>, false, > true> ()(std::tr1::_Placeholder<1>, > std::tr1::tuple<zookeeper::GroupProcess*&>))>::type)>::type std::tr1 > ::_Bind<std::tr1::_Mem_fn<void (zookeeper::GroupProcess::*)(long, > std::basic_string<char, std::char_traits<char>, std::allocator<char> > > const&)> ()(std::tr1::_Placeholder<1>, lo > ng, std::basic_string<char, std::char_traits<char>, std::allocator<char> > >)>::__call<zookeeper::GroupProcess*&, 0, 1, > 2>(std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> ( c > onst&)(std::tr1::_Placeholder<1>, > std::tr1::tuple<zookeeper::GroupProcess*&>), std::tr1::_Index_tuple<0, 1, 2>) > () from /Users/mchadha/venv/lib/python2.7/site-packages/mesos.nati > ve-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_mesos.so > #12 0x00007f4286cdfd16 in std::tr1::result_of<std::tr1::_Mem_fn<void > (zookeeper::GroupProcess::*)(long, std::basic_string<char, > std::char_traits<char>, std::allocator<char> > con > st&)> ()(std::tr1::result_of<std::tr1::_Mu<std::tr1::_Placeholder<1>, false, > true> ()(std::tr1::_Placeholder<1>, > std::tr1::tuple<zookeeper::GroupProcess*>)>::type, std::tr1::resu > lt_of<std::tr1::_Mu<long, false, false> ()(long, > std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> > ()(std::tr1::_Placeholder<1>, std::tr1::tuple<zookeeper::GroupProcess*>))>: > :type, std::tr1::result_of<std::tr1::_Mu<std::basic_string<char, > std::char_traits<char>, std::allocator<char> >, false, false> > ()(std::basic_string<char, std::char_traits<char>, > std::allocator<char> >, std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> > ()(std::tr1::_Placeholder<1>, > std::tr1::tuple<zookeeper::GroupProcess*>))>::type)>::type std::tr1::_ > Bind<std::tr1::_Mem_fn<void (zookeeper::GroupProcess::*)(long, > std::basic_string<char, std::char_traits<char>, std::allocator<char> > > const&)> ()(std::tr1::_Placeholder<1>, long, > std::basic_string<char, std::char_traits<char>, std::allocator<char> > >)>::operator()<zookeeper::GroupProcess*>(zookeeper::GroupProcess*&) () from > /Users/mchadha/venv/lib/python2 > .7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_mesos.so > #13 0x00007f4286cdf8be in std::tr1::_Function_handler<void > ()(zookeeper::GroupProcess*), std::tr1::_Bind<std::tr1::_Mem_fn<void > (zookeeper::GroupProcess::*)(long, std::basic_stri > ng<char, std::char_traits<char>, std::allocator<char> > const&)> > ()(std::tr1::_Placeholder<1>, long, std::basic_string<char, > std::char_traits<char>, std::allocator<char> >)> >::_ > M_invoke(std::tr1::_Any_data const&, zookeeper::GroupProcess*) () from > /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/ > _mesos.so > #14 0x00007f4286cc2394 in std::tr1::function<void > ()(zookeeper::GroupProcess*)>::operator()(zookeeper::GroupProcess*) const () > from /Users/mchadha/venv/lib/python2.7/site-package > s/mesos.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_mesos.so > #15 0x00007f4286cbc3a2 in void > process::internal::vdispatcher<zookeeper::GroupProcess>(process::ProcessBase*, > std::tr1::shared_ptr<std::tr1::function<void ()(zookeeper::GroupProc > ess*)> >) () from > /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_mesos.so > #16 0x00007f4286ccdca5 in std::tr1::result_of<void > (*()(std::tr1::result_of<std::tr1::_Mu<std::tr1::_Placeholder<1>, false, > true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple<pr > ocess::ProcessBase*&>)>::type, > std::tr1::result_of<std::tr1::_Mu<std::tr1::shared_ptr<std::tr1::function<void > ()(zookeeper::GroupProcess*)> >, false, false> ()(std::tr1::shared_p > tr<std::tr1::function<void ()(zookeeper::GroupProcess*)> >, > std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> > ()(std::tr1::_Placeholder<1>, std::tr1::tuple<process::ProcessBa > se*&>))>::type))(process::ProcessBase*, > std::tr1::shared_ptr<std::tr1::function<void ()(zookeeper::GroupProcess*)> > >)>::type std::tr1::_Bind<void (*()(std::tr1::_Placeholder<1>, > std::tr1::shared_ptr<std::tr1::function<void ()(zookeeper::GroupProcess*)> > >))(process::ProcessBase*, std::tr1::shared_ptr<std::tr1::function<void > ()(zookeeper::GroupProcess*)> > > )>::__call<process::ProcessBase*&, 0, > 1>(std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> ( > const&)(std::tr1::_Placeholder<1>, std::tr1::tuple<process::ProcessBase*&>), > std: > :tr1::_Index_tuple<0, 1>) () from > /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_mesos.so > #17 0x00007f4286cc7a5a in std::tr1::result_of<void > (*()(std::tr1::result_of<std::tr1::_Mu<std::tr1::_Placeholder<1>, false, > true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple<pr > ocess::ProcessBase*>)>::type, > std::tr1::result_of<std::tr1::_Mu<std::tr1::shared_ptr<std::tr1::function<void > ()(zookeeper::GroupProcess*)> >, false, false> ()(std::tr1::shared_pt > r<std::tr1::function<void ()(zookeeper::GroupProcess*)> >, > std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> > ()(std::tr1::_Placeholder<1>, std::tr1::tuple<process::ProcessBas > e*>))>::type))(process::ProcessBase*, > std::tr1::shared_ptr<std::tr1::function<void ()(zookeeper::GroupProcess*)> > >)>::type std::tr1::_Bind<void (*()(std::tr1::_Placeholder<1>, st > d::tr1::shared_ptr<std::tr1::function<void ()(zookeeper::GroupProcess*)> > >))(process::ProcessBase*, std::tr1::shared_ptr<std::tr1::function<void > ()(zookeeper::GroupProcess*)> >)> > ::operator()<process::ProcessBase*>(process::ProcessBase*&) () from > /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_me > sos.so > #18 0x00007f4286cc2480 in std::tr1::_Function_handler<void > ()(process::ProcessBase*), std::tr1::_Bind<void > (*()(std::tr1::_Placeholder<1>, std::tr1::shared_ptr<std::tr1::function > <void ()(zookeeper::GroupProcess*)> >))(process::ProcessBase*, > std::tr1::shared_ptr<std::tr1::function<void ()(zookeeper::GroupProcess*)> > >)> >::_M_invoke(std::tr1::_Any_data con > st&, process::ProcessBase*) () from > /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_mesos.so > #19 0x00007f42870db546 in std::tr1::function<void > ()(process::ProcessBase*)>::operator()(process::ProcessBase*) const () from > /Users/mchadha/venv/lib/python2.7/site-packages/meso > s.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_mesos.so > #20 0x00007f42870c1013 in process::ProcessBase::visit(process::DispatchEvent > const&) () from > /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x8 > 6_64.egg/mesos/native/_mesos.so > #21 0x00007f42870c5582 in > process::DispatchEvent::visit(process::EventVisitor*) const () from > /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x > 86_64.egg/mesos/native/_mesos.so > #22 0x00007f428666680e in process::ProcessBase::serve(process::Event const&) > () from > /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg > /mesos/native/_mesos.so > #23 0x00007f42870bd88f in > process::ProcessManager::resume(process::ProcessBase*) () from > /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64 > .egg/mesos/native/_mesos.so > #24 0x00007f42870b1cb9 in process::schedule(void*) () from > /Users/mchadha/venv/lib/python2.7/site-packages/mesos.native-0.22.1003-py2.7-linux-x86_64.egg/mesos/native/_mesos.so > #25 0x00000031410079d1 in start_thread () from /lib64/libpthread.so.0 > #26 0x00000031408e88fd in clone () from /lib64/libc.so.6 > {code} > Solution: > Create master detector per url instead of per framework. > Will send the review request. -- This message was sent by Atlassian JIRA (v6.3.4#6332)