[ 
https://issues.apache.org/jira/browse/KUDU-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16410244#comment-16410244
 ] 

Todd Lipcon commented on KUDU-2370:
-----------------------------------

As for the impact of this issue, I'm looking through logs on a 120-node cluster 
which is under some load and found cases like this:

- 19/20 ConsensusService threads are blocked on RaftConsensus::lock_
-- 1 thread in RaftConsensus::UpdateReplica waiting to acquire 'lock_'
-- 4 in RaftConsensus::Update() waiting to acquire update_lock_
-- 11 in RequestVote trying to acquire lock_
-- 1 in StartElection
-- 3 in DoElectionCallback

The only thread not blocked is in this stack:
{code}
        0x337a80f7e0 <unknown>
        0x337a80ba5e <unknown>
           0x1c25269 kudu::ConditionVariable::WaitUntil()
            0xaf280f kudu::consensus::RaftConsensus::UpdateReplica()
            0xaf3ff7 kudu::consensus::RaftConsensus::Update()
            0x8bd6a9 kudu::tserver::ConsensusServiceImpl::UpdateConsensus()
           0x1b5bf3d kudu::rpc::GeneratedServiceIf::Handle()
           0x1b5cc4f kudu::rpc::ServicePool::RunThread()
           0x1cc0ef1 kudu::Thread::SuperviseThread()
{code}

The AppendThread is itself just waiting on slow IO:
{code}
W0322 01:59:12.722179 78194 kernel_stack_watchdog.cc:191] Thread 190634 stuck 
at ../../src/kudu/consensus/log.cc:664 for 2922ms:
Kernel stack:
[<ffffffffa009f09d>] do_get_write_access+0x29d/0x520 [jbd2]
[<ffffffffa009f471>] jbd2_journal_get_write_access+0x31/0x50 [jbd2]
[<ffffffffa00ece58>] __ext4_journal_get_write_access+0x38/0x80 [ext4]
[<ffffffffa00c6bb3>] ext4_reserve_inode_write+0x73/0xa0 [ext4]
[<ffffffffa00c6c2c>] ext4_mark_inode_dirty+0x4c/0x1d0 [ext4]
[<ffffffffa00c6f20>] ext4_dirty_inode+0x40/0x60 [ext4]
[<ffffffff811becfb>] __mark_inode_dirty+0x3b/0x160
[<ffffffff811af3e2>] file_update_time+0xf2/0x170
[<ffffffff81129cf0>] __generic_file_aio_write+0x230/0x490
[<ffffffff81129fd8>] generic_file_aio_write+0x88/0x100
[<ffffffffa00c0e08>] ext4_file_write+0x58/0x190 [ext4]
[<ffffffff81191dcb>] do_sync_readv_writev+0xfb/0x140
[<ffffffff81192e76>] do_readv_writev+0xd6/0x1f0
[<ffffffff81192fd6>] vfs_writev+0x46/0x60
[<ffffffff81193092>] sys_pwritev+0xa2/0xc0
[<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
{code}

So, it seems like despite having 20 threads, they are all blocked on work on 
this one tablet, and then this causes a bunch of pre-elections even on idle 
tablets.

> Allow accessing consensus metadata during flush/sync
> ----------------------------------------------------
>
>                 Key: KUDU-2370
>                 URL: https://issues.apache.org/jira/browse/KUDU-2370
>             Project: Kudu
>          Issue Type: Improvement
>          Components: consensus, perf
>    Affects Versions: 1.8.0
>            Reporter: Todd Lipcon
>            Priority: Major
>
> In some cases when disks are overloaded or starting to go bad, flushing 
> consensus metadata can take a significant amount of time. Currently, we hold 
> the RaftConsensus::lock_ for the duration of things like voting or changing 
> term, which blocks other requests such as writes or UpdateConsensus calls. 
> There are certainly some cases where exposing "dirty" (non-durable) cmeta is 
> illegal from a Raft perspectives, but there are other cases where it is safe. 
> For example:
> - assume we receive a Write request, and we see that cmeta is currently busy 
> flushing a change that marks the local replica as a FOLLOWER. In that case, 
> if we wait on the lock, when we eventually acquire it, we'll just reject the 
> request anyway. We might as well reject it immediately.
> - Assume we receive a Write request, and we see that cmeta is currently 
> flushing a change that will mark the local replica as a LEADER in the next 
> term. CheckLeadershipAndBindTerm can safely bind to the upcoming term rather 
> than blocking until the flush completes.
> - Assume we recieve an UpdateConsensus or Vote request for term N, and we see 
> that we're currently flushing a change to term M > N. I think it's safe to 
> reject the request even though the new term isn't yet durable.
> Probably a few other cases here where it's safe to act on not-yet-durable 
> info.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to