[ https://issues.apache.org/jira/browse/KUDU-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16410244#comment-16410244 ]
Todd Lipcon commented on KUDU-2370: ----------------------------------- As for the impact of this issue, I'm looking through logs on a 120-node cluster which is under some load and found cases like this: - 19/20 ConsensusService threads are blocked on RaftConsensus::lock_ -- 1 thread in RaftConsensus::UpdateReplica waiting to acquire 'lock_' -- 4 in RaftConsensus::Update() waiting to acquire update_lock_ -- 11 in RequestVote trying to acquire lock_ -- 1 in StartElection -- 3 in DoElectionCallback The only thread not blocked is in this stack: {code} 0x337a80f7e0 <unknown> 0x337a80ba5e <unknown> 0x1c25269 kudu::ConditionVariable::WaitUntil() 0xaf280f kudu::consensus::RaftConsensus::UpdateReplica() 0xaf3ff7 kudu::consensus::RaftConsensus::Update() 0x8bd6a9 kudu::tserver::ConsensusServiceImpl::UpdateConsensus() 0x1b5bf3d kudu::rpc::GeneratedServiceIf::Handle() 0x1b5cc4f kudu::rpc::ServicePool::RunThread() 0x1cc0ef1 kudu::Thread::SuperviseThread() {code} The AppendThread is itself just waiting on slow IO: {code} W0322 01:59:12.722179 78194 kernel_stack_watchdog.cc:191] Thread 190634 stuck at ../../src/kudu/consensus/log.cc:664 for 2922ms: Kernel stack: [<ffffffffa009f09d>] do_get_write_access+0x29d/0x520 [jbd2] [<ffffffffa009f471>] jbd2_journal_get_write_access+0x31/0x50 [jbd2] [<ffffffffa00ece58>] __ext4_journal_get_write_access+0x38/0x80 [ext4] [<ffffffffa00c6bb3>] ext4_reserve_inode_write+0x73/0xa0 [ext4] [<ffffffffa00c6c2c>] ext4_mark_inode_dirty+0x4c/0x1d0 [ext4] [<ffffffffa00c6f20>] ext4_dirty_inode+0x40/0x60 [ext4] [<ffffffff811becfb>] __mark_inode_dirty+0x3b/0x160 [<ffffffff811af3e2>] file_update_time+0xf2/0x170 [<ffffffff81129cf0>] __generic_file_aio_write+0x230/0x490 [<ffffffff81129fd8>] generic_file_aio_write+0x88/0x100 [<ffffffffa00c0e08>] ext4_file_write+0x58/0x190 [ext4] [<ffffffff81191dcb>] do_sync_readv_writev+0xfb/0x140 [<ffffffff81192e76>] do_readv_writev+0xd6/0x1f0 [<ffffffff81192fd6>] vfs_writev+0x46/0x60 [<ffffffff81193092>] sys_pwritev+0xa2/0xc0 [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff {code} So, it seems like despite having 20 threads, they are all blocked on work on this one tablet, and then this causes a bunch of pre-elections even on idle tablets. > Allow accessing consensus metadata during flush/sync > ---------------------------------------------------- > > Key: KUDU-2370 > URL: https://issues.apache.org/jira/browse/KUDU-2370 > Project: Kudu > Issue Type: Improvement > Components: consensus, perf > Affects Versions: 1.8.0 > Reporter: Todd Lipcon > Priority: Major > > In some cases when disks are overloaded or starting to go bad, flushing > consensus metadata can take a significant amount of time. Currently, we hold > the RaftConsensus::lock_ for the duration of things like voting or changing > term, which blocks other requests such as writes or UpdateConsensus calls. > There are certainly some cases where exposing "dirty" (non-durable) cmeta is > illegal from a Raft perspectives, but there are other cases where it is safe. > For example: > - assume we receive a Write request, and we see that cmeta is currently busy > flushing a change that marks the local replica as a FOLLOWER. In that case, > if we wait on the lock, when we eventually acquire it, we'll just reject the > request anyway. We might as well reject it immediately. > - Assume we receive a Write request, and we see that cmeta is currently > flushing a change that will mark the local replica as a LEADER in the next > term. CheckLeadershipAndBindTerm can safely bind to the upcoming term rather > than blocking until the flush completes. > - Assume we recieve an UpdateConsensus or Vote request for term N, and we see > that we're currently flushing a change to term M > N. I think it's safe to > reject the request even though the new term isn't yet durable. > Probably a few other cases here where it's safe to act on not-yet-durable > info. -- This message was sent by Atlassian JIRA (v7.6.3#76005)