[ https://issues.apache.org/jira/browse/KUDU-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17117387#comment-17117387 ]
Alexey Serbin commented on KUDU-1865: ------------------------------------- Some more stacks captured from diagnostic logs for {{kudu-master}} process (kudu 1.10): {noformat} Stacks at 0516 18:53:00.042003 (service queue overflowed for kudu.master.MasterService): tids=[736230] 0x7f803a76a5e0 <unknown> 0xb6219e tcmalloc::ThreadCache::ReleaseToCentralCache() 0xb62530 tcmalloc::ThreadCache::Scavenge() 0xad8a27 kudu::master::CatalogManager::ScopedLeaderSharedLock::ScopedLeaderSharedLock() 0xaa3a31 kudu::master::MasterServiceImpl::GetTableSchema() 0x221aaa9 kudu::rpc::GeneratedServiceIf::Handle() 0x221b1a9 kudu::rpc::ServicePool::RunThread() 0x23a8f84 kudu::Thread::SuperviseThread() 0x7f803a762e25 start_thread 0x7f8038a4134d __clone tids=[736248,736245,736243,736242] 0x7f803a76a5e0 <unknown> 0x23c5b44 base::internal::SpinLockDelay() 0x23c59cc base::SpinLock::SlowLock() 0xac5814 kudu::master::CatalogManager::CheckOnline() 0xae5032 kudu::master::CatalogManager::GetTableSchema() 0xaa3a85 kudu::master::MasterServiceImpl::GetTableSchema() 0x221aaa9 kudu::rpc::GeneratedServiceIf::Handle() 0x221b1a9 kudu::rpc::ServicePool::RunThread() 0x23a8f84 kudu::Thread::SuperviseThread() 0x7f803a762e25 start_thread 0x7f8038a4134d __clone tids=[736239,736229,736232,736233,736234,736235,736236,736237,736238,736240,736241,736244,736247] 0x7f803a76a5e0 <unknown> 0x23c5b44 base::internal::SpinLockDelay() 0x23c59cc base::SpinLock::SlowLock() 0xac5814 kudu::master::CatalogManager::CheckOnline() 0xaf102f kudu::master::CatalogManager::GetTableLocations() 0xaa36f8 kudu::master::MasterServiceImpl::GetTableLocations() 0x221aaa9 kudu::rpc::GeneratedServiceIf::Handle() 0x221b1a9 kudu::rpc::ServicePool::RunThread() 0x23a8f84 kudu::Thread::SuperviseThread() 0x7f803a762e25 start_thread 0x7f8038a4134d __clone tids=[736246,736231] 0x7f803a76a5e0 <unknown> 0x23c5b44 base::internal::SpinLockDelay() 0x23c59cc base::SpinLock::SlowLock() 0xad8b7c kudu::master::CatalogManager::ScopedLeaderSharedLock::ScopedLeaderSharedLock() 0xaa369d kudu::master::MasterServiceImpl::GetTableLocations() 0x221aaa9 kudu::rpc::GeneratedServiceIf::Handle() 0x221b1a9 kudu::rpc::ServicePool::RunThread() 0x23a8f84 kudu::Thread::SuperviseThread() 0x7f803a762e25 start_thread 0x7f8038a4134d __clone {noformat} > Create fast path for RespondSuccess() in KRPC > --------------------------------------------- > > Key: KUDU-1865 > URL: https://issues.apache.org/jira/browse/KUDU-1865 > Project: Kudu > Issue Type: Improvement > Components: rpc > Reporter: Sailesh Mukil > Priority: Major > Labels: perfomance, rpc > Attachments: alloc-pattern.py, cross-thread.txt > > > A lot of RPCs just respond with RespondSuccess() which returns the exact > payload every time. This takes the same path as any other response by > ultimately calling Connection::QueueResponseForCall() which has a few small > allocations. These small allocations (and their corresponding deallocations) > are called quite frequently (once for every IncomingCall) and end up taking > quite some time in the kernel (traversing the free list, spin locks etc.) > This was found when [~mmokhtar] ran some profiles on Impala over KRPC on a 20 > node cluster and found the following: > The exact % of time spent is hard to quantify from the profiles, but these > were the among the top 5 of the slowest stacks: > {code:java} > impalad ! tcmalloc::CentralFreeList::ReleaseToSpans - [unknown source file] > impalad ! tcmalloc::CentralFreeList::ReleaseListToSpans + 0x1a - [unknown > source file] > impalad ! tcmalloc::CentralFreeList::InsertRange + 0x3b - [unknown source > file] > impalad ! tcmalloc::ThreadCache::ReleaseToCentralCache + 0x103 - [unknown > source file] > impalad ! tcmalloc::ThreadCache::Scavenge + 0x3e - [unknown source file] > impalad ! operator delete + 0x329 - [unknown source file] > impalad ! __gnu_cxx::new_allocator<kudu::Slice>::deallocate + 0x4 - > new_allocator.h:110 > impalad ! std::_Vector_base<kudu::Slice, > std::allocator<kudu::Slice>>::_M_deallocate + 0x5 - stl_vector.h:178 > impalad ! ~_Vector_base + 0x4 - stl_vector.h:160 > impalad ! ~vector - stl_vector.h:425 <----Deleting > 'slices' vector > impalad ! kudu::rpc::Connection::QueueResponseForCall + 0xac - > connection.cc:433 > impalad ! kudu::rpc::InboundCall::Respond + 0xfa - inbound_call.cc:133 > impalad ! kudu::rpc::InboundCall::RespondSuccess + 0x43 - inbound_call.cc:77 > impalad ! kudu::rpc::RpcContext::RespondSuccess + 0x1f7 - rpc_context.cc:66 > .. > {code} > {code:java} > impalad ! tcmalloc::CentralFreeList::FetchFromOneSpans - [unknown source file] > impalad ! tcmalloc::CentralFreeList::RemoveRange + 0xc0 - [unknown source > file] > impalad ! tcmalloc::ThreadCache::FetchFromCentralCache + 0x62 - [unknown > source file] > impalad ! operator new + 0x297 - [unknown source file] <--- Creating > new 'OutboundTransferTask' object. > impalad ! kudu::rpc::Connection::QueueResponseForCall + 0x76 - > connection.cc:432 > impalad ! kudu::rpc::InboundCall::Respond + 0xfa - inbound_call.cc:133 > impalad ! kudu::rpc::InboundCall::RespondSuccess + 0x43 - inbound_call.cc:77 > impalad ! kudu::rpc::RpcContext::RespondSuccess + 0x1f7 - rpc_context.cc:66 > ... > {code} > Even creating and deleting the 'RpcContext' takes a lot of time: > {code:java} > impalad ! tcmalloc::CentralFreeList::ReleaseToSpans - [unknown source file] > impalad ! tcmalloc::CentralFreeList::ReleaseListToSpans + 0x1a - [unknown > source file] > impalad ! tcmalloc::CentralFreeList::InsertRange + 0x3b - [unknown source > file] > impalad ! tcmalloc::ThreadCache::ReleaseToCentralCache + 0x103 - [unknown > source file] > impalad ! tcmalloc::ThreadCache::Scavenge + 0x3e - [unknown source file] > impalad ! operator delete + 0x329 - [unknown source file] > impalad ! impala::TransmitDataResponsePb::~TransmitDataResponsePb + 0x16 - > impala_internal_service.pb.cc:1221 > impalad ! impala::TransmitDataResponsePb::~TransmitDataResponsePb + 0x8 - > impala_internal_service.pb.cc:1222 > impalad ! kudu::DefaultDeleter<google::protobuf::Message>::operator() + 0x5 - > gscoped_ptr.h:145 > impalad ! ~gscoped_ptr_impl + 0x9 - gscoped_ptr.h:228 > impalad ! ~gscoped_ptr - gscoped_ptr.h:318 > impalad ! kudu::rpc::RpcContext::~RpcContext + 0x1e - rpc_context.cc:53 > <----- > impalad ! kudu::rpc::RpcContext::RespondSuccess + 0x1ff - rpc_context.cc:67 > {code} > The above show that creating these small objects under moderately heavy load > results in heavy contention in the kernel. We will benefit a lot if we create > a fast path for 'RespondSuccess'. > My suggestion is to create all these small objects at once along with the > 'InboundCall' object while it is being created, in a 'RespondSuccess' > structure, and just use that structure when we want to send 'success' back to > the sender. This would already contain the 'OutboundTransferTask', a 'Slice' > with 'success', etc. We would expect that most RPCs respond with 'success' a > majority of the time. > How this would benefit us is that we don't go back and forth every time to > allocate and deallocate memory for these small objects, instead we do it all > at once while creating the 'InboundCall' object. > I just wanted to start a discussion about this, so even if what I suggested > seems a little off, hopefully we can move forward with this on some level. -- This message was sent by Atlassian Jira (v8.3.4#803005)