Adar Dembo has submitted this change and it was merged. Change subject: catalog_manager: avoid race in InitSysCatalogAsync() and GetTabletPeer() ......................................................................
catalog_manager: avoid race in InitSysCatalogAsync() and GetTabletPeer() Commit 2525ad0 took a stab at this, but it doesn't handle the case where InitSysCatalogAsync() fails and leaves behind sys_catalog_ without a functional tablet peer, as in the new integration test MasterReplicationTest.TestMasterPeerSetsDontMatch. So here's another attempt, where sys_catalog_ is only set when it is fully formed (i.e. when it has a functional TabletPeer). It turns out this isn't enough; we also need to prevent ElectedAsLeaderCb from making progress until InitSysCatalogAsync() sets sys_catalog_. The extra lock acquisition is hacky in that it doesn't explicitly protect anything, but it gets the job done. Below I've included test output when the race hits. master_replication-itest: /home/jenkins-slave/workspace/kudu-3/src/kudu/gutil/ref_counted.h:273: T *scoped_refptr<kudu::tablet::TabletPeer>::operator->() const [T = kudu::tablet::TabletPeer]: Assertion `ptr_ != __null' failed. *** Aborted at 1471309445 (unix time) try "date -d @1471309445" if you are using GNU date *** PC: @ 0x7f330225dcc9 gsignal *** SIGABRT (@0x3e800006e90) received by PID 28304 (TID 0x7f32f06eb700) from PID 28304; stack trace: *** @ 0x42e687 __tsan::CallUserSignalHandler() at /home/jenkins-slave/workspace/kudu-3/thirdparty/llvm-3.8.0.src/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:1962 @ 0x42f4d3 rtl_sigaction() at /home/jenkins-slave/workspace/kudu-3/thirdparty/llvm-3.8.0.src/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:2039 @ 0x7f33090a4340 (unknown) at ??:0 @ 0x7f330225dcc9 gsignal at ??:0 @ 0x7f33022610d8 abort at ??:0 @ 0x7f3302256b86 (unknown) at ??:0 @ 0x7f3302256c32 __assert_fail at ??:0 @ 0x7f330ca13130 scoped_refptr<>::operator->() at ??:0 @ 0x7f330ca1a952 kudu::master::SysCatalogTable::tablet_id() at ??:0 @ 0x7f330ca0b136 kudu::master::CatalogManager::GetTabletPeer() at ??:0 @ 0x7f330c69214d kudu::tserver::(anonymous namespace)::LookupTabletPeerOrRespond<>() at ??:0 @ 0x7f330c691bab kudu::tserver::ConsensusServiceImpl::RequestConsensusVote() at ??:0 @ 0x7f3307c9fca5 kudu::consensus::ConsensusServiceIf::ConsensusServiceIf()::$_1::operator()() at ??:0 @ 0x7f3307c9fabf std::_Function_handler<>::_M_invoke() at ??:0 @ 0x7f3306bd7219 std::function<>::operator()() at ??:0 @ 0x7f3306bd6c8e kudu::rpc::GeneratedServiceIf::Handle() at ??:0 @ 0x7f3306bd8b3e kudu::rpc::ServicePool::RunThread() at ??:0 @ 0x7f3306bdaa27 boost::_mfi::mf0<>::operator()() at ??:0 @ 0x7f3306bda98b boost::_bi::list1<>::operator()<>() at ??:0 @ 0x7f3306bda934 boost::_bi::bind_t<>::operator()() at ??:0 @ 0x7f3306bda75a boost::detail::function::void_function_obj_invoker0<>::invoke() at ??:0 @ 0x7f3306b758b2 boost::function0<>::operator()() at ??:0 @ 0x7f3304962630 kudu::Thread::SuperviseThread() at ??:0 Change-Id: I43fdc6499cb84d2053bed08b689fe5a08a6761d6 Reviewed-on: http://gerrit.cloudera.org:8080/3997 Tested-by: Kudu Jenkins Reviewed-by: Todd Lipcon <t...@apache.org> --- M src/kudu/master/catalog_manager.cc 1 file changed, 12 insertions(+), 6 deletions(-) Approvals: Todd Lipcon: Looks good to me, approved Kudu Jenkins: Verified -- To view, visit http://gerrit.cloudera.org:8080/3997 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: merged Gerrit-Change-Id: I43fdc6499cb84d2053bed08b689fe5a08a6761d6 Gerrit-PatchSet: 3 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Adar Dembo <a...@cloudera.com> Gerrit-Reviewer: Adar Dembo <a...@cloudera.com> Gerrit-Reviewer: Dinesh Bhat <din...@cloudera.com> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon <t...@apache.org>