[jira] [Created] (KUDU-3154) RangerClientTestBase.TestLogging sometimes fails

2020-06-20 Thread Alexey Serbin (Jira)
Alexey Serbin created KUDU-3154:
---

 Summary: RangerClientTestBase.TestLogging sometimes fails
 Key: KUDU-3154
 URL: https://issues.apache.org/jira/browse/KUDU-3154
 Project: Kudu
  Issue Type: Bug
  Components: ranger, test
Affects Versions: 1.13.0
Reporter: Alexey Serbin
 Attachments: ranger_client-test.txt.xz

The {{RangerClientTestBase.TestLogging}} scenario of the {{ranger_client-test}} 
sometimes fails (all types of builds) with error message like below:

{noformat}
src/kudu/ranger/ranger_client-test.cc:398: Failure
Failed  
Bad status: Timed out: timed out while in flight
I0620 07:06:02.907177  1140 server.cc:247] Received an EOF from the subprocess  
I0620 07:06:02.910923  1137 server.cc:317] get failed, inbound queue shut down: 
Aborted:
I0620 07:06:02.910964  1141 server.cc:380] outbound queue shut down: Aborted:   
I0620 07:06:02.910995  1138 server.cc:317] get failed, inbound queue shut down: 
Aborted:
I0620 07:06:02.910984  1139 server.cc:317] get failed, inbound queue shut down: 
Aborted:
{noformat}

The log is attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-2727) Contention on the Raft consensus lock can cause tablet service queue overflows

2020-06-18 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-2727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-2727:

Component/s: (was: perf)
 tserver
 consensus

> Contention on the Raft consensus lock can cause tablet service queue overflows
> --
>
> Key: KUDU-2727
> URL: https://issues.apache.org/jira/browse/KUDU-2727
> Project: Kudu
>  Issue Type: Improvement
>  Components: consensus, tserver
>Reporter: William Berkeley
>Assignee: Alexey Serbin
>Priority: Major
>  Labels: performance, scalability
>
> Here's stacks illustrating the phenomenon:
> {noformat}
>   tids=[2201]
> 0x379ba0f710 
>0x1fb951a base::internal::SpinLockDelay()
>0x1fb93b7 base::SpinLock::SlowLock()
> 0xb4e68e kudu::consensus::Peer::SignalRequest()
> 0xb9c0df kudu::consensus::PeerManager::SignalRequest()
> 0xb8c178 kudu::consensus::RaftConsensus::Replicate()
> 0xaab816 kudu::tablet::TransactionDriver::Prepare()
> 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask()
>0x1fa37ed kudu::ThreadPool::DispatchThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
>   tids=[4515]
> 0x379ba0f710 
>0x1fb951a base::internal::SpinLockDelay()
>0x1fb93b7 base::SpinLock::SlowLock()
> 0xb74c60 kudu::consensus::RaftConsensus::NotifyCommitIndex()
> 0xb59307 kudu::consensus::PeerMessageQueue::NotifyObserversTask()
> 0xb54058 
> _ZN4kudu8internal7InvokerILi2ENS0_9BindStateINS0_15RunnableAdapterIMNS_9consensus16PeerMessageQueueEFvRKSt8functionIFvPNS4_24PeerMessageQueueObserverEEFvPS5_SC_EFvNS0_17UnretainedWrapperIS5_EEZNS5_34NotifyObserversOfCommitIndexChangeElEUlS8_E_EEESH_E3RunEPNS0_13BindStateBaseE
>0x1fa37ed kudu::ThreadPool::DispatchThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
>   tids=[22185,22194,22193,22188,22187,22186]
> 0x379ba0f710 
>0x1fb951a base::internal::SpinLockDelay()
>0x1fb93b7 base::SpinLock::SlowLock()
> 0xb8bff8 
> kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm()
> 0xaaaef9 kudu::tablet::TransactionDriver::ExecuteAsync()
> 0xaa3742 kudu::tablet::TabletReplica::SubmitWrite()
> 0x92812d kudu::tserver::TabletServiceImpl::Write()
>0x1e28f3c kudu::rpc::GeneratedServiceIf::Handle()
>0x1e2986a kudu::rpc::ServicePool::RunThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
>   tids=[22192,22191]
> 0x379ba0f710 
>0x1fb951a base::internal::SpinLockDelay()
>0x1fb93b7 base::SpinLock::SlowLock()
>0x1e13dec kudu::rpc::ResultTracker::TrackRpc()
>0x1e28ef5 kudu::rpc::GeneratedServiceIf::Handle()
>0x1e2986a kudu::rpc::ServicePool::RunThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
>   tids=[4426]
> 0x379ba0f710 
>0x206d3d0 
>0x212fd25 google::protobuf::Message::SpaceUsedLong()
>0x211dee4 
> google::protobuf::internal::GeneratedMessageReflection::SpaceUsedLong()
> 0xb6658e kudu::consensus::LogCache::AppendOperations()
> 0xb5c539 kudu::consensus::PeerMessageQueue::AppendOperations()
> 0xb5c7c7 kudu::consensus::PeerMessageQueue::AppendOperation()
> 0xb7c675 
> kudu::consensus::RaftConsensus::AppendNewRoundToQueueUnlocked()
> 0xb8c147 kudu::consensus::RaftConsensus::Replicate()
> 0xaab816 kudu::tablet::TransactionDriver::Prepare()
> 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask()
>0x1fa37ed kudu::ThreadPool::DispatchThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
> {noformat}
> {{kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm()}} needs to 
> take the lock to check the term and the Raft role. When many RPCs come in for 
> the same tablet, the contention can hog service threads and cause queue 
> overflows on busy systems.
> Yugabyte switched their equivalent lock to be an atomic that allows them to 
> read the term and role wait-free.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-2727) Contention on the Raft consensus lock can cause tablet service queue overflows

2020-06-18 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-2727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-2727:

Status: In Review  (was: In Progress)

> Contention on the Raft consensus lock can cause tablet service queue overflows
> --
>
> Key: KUDU-2727
> URL: https://issues.apache.org/jira/browse/KUDU-2727
> Project: Kudu
>  Issue Type: Improvement
>  Components: perf
>Reporter: William Berkeley
>Assignee: Alexey Serbin
>Priority: Major
>  Labels: performance, scalability
>
> Here's stacks illustrating the phenomenon:
> {noformat}
>   tids=[2201]
> 0x379ba0f710 
>0x1fb951a base::internal::SpinLockDelay()
>0x1fb93b7 base::SpinLock::SlowLock()
> 0xb4e68e kudu::consensus::Peer::SignalRequest()
> 0xb9c0df kudu::consensus::PeerManager::SignalRequest()
> 0xb8c178 kudu::consensus::RaftConsensus::Replicate()
> 0xaab816 kudu::tablet::TransactionDriver::Prepare()
> 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask()
>0x1fa37ed kudu::ThreadPool::DispatchThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
>   tids=[4515]
> 0x379ba0f710 
>0x1fb951a base::internal::SpinLockDelay()
>0x1fb93b7 base::SpinLock::SlowLock()
> 0xb74c60 kudu::consensus::RaftConsensus::NotifyCommitIndex()
> 0xb59307 kudu::consensus::PeerMessageQueue::NotifyObserversTask()
> 0xb54058 
> _ZN4kudu8internal7InvokerILi2ENS0_9BindStateINS0_15RunnableAdapterIMNS_9consensus16PeerMessageQueueEFvRKSt8functionIFvPNS4_24PeerMessageQueueObserverEEFvPS5_SC_EFvNS0_17UnretainedWrapperIS5_EEZNS5_34NotifyObserversOfCommitIndexChangeElEUlS8_E_EEESH_E3RunEPNS0_13BindStateBaseE
>0x1fa37ed kudu::ThreadPool::DispatchThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
>   tids=[22185,22194,22193,22188,22187,22186]
> 0x379ba0f710 
>0x1fb951a base::internal::SpinLockDelay()
>0x1fb93b7 base::SpinLock::SlowLock()
> 0xb8bff8 
> kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm()
> 0xaaaef9 kudu::tablet::TransactionDriver::ExecuteAsync()
> 0xaa3742 kudu::tablet::TabletReplica::SubmitWrite()
> 0x92812d kudu::tserver::TabletServiceImpl::Write()
>0x1e28f3c kudu::rpc::GeneratedServiceIf::Handle()
>0x1e2986a kudu::rpc::ServicePool::RunThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
>   tids=[22192,22191]
> 0x379ba0f710 
>0x1fb951a base::internal::SpinLockDelay()
>0x1fb93b7 base::SpinLock::SlowLock()
>0x1e13dec kudu::rpc::ResultTracker::TrackRpc()
>0x1e28ef5 kudu::rpc::GeneratedServiceIf::Handle()
>0x1e2986a kudu::rpc::ServicePool::RunThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
>   tids=[4426]
> 0x379ba0f710 
>0x206d3d0 
>0x212fd25 google::protobuf::Message::SpaceUsedLong()
>0x211dee4 
> google::protobuf::internal::GeneratedMessageReflection::SpaceUsedLong()
> 0xb6658e kudu::consensus::LogCache::AppendOperations()
> 0xb5c539 kudu::consensus::PeerMessageQueue::AppendOperations()
> 0xb5c7c7 kudu::consensus::PeerMessageQueue::AppendOperation()
> 0xb7c675 
> kudu::consensus::RaftConsensus::AppendNewRoundToQueueUnlocked()
> 0xb8c147 kudu::consensus::RaftConsensus::Replicate()
> 0xaab816 kudu::tablet::TransactionDriver::Prepare()
> 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask()
>0x1fa37ed kudu::ThreadPool::DispatchThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
> {noformat}
> {{kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm()}} needs to 
> take the lock to check the term and the Raft role. When many RPCs come in for 
> the same tablet, the contention can hog service threads and cause queue 
> overflows on busy systems.
> Yugabyte switched their equivalent lock to be an atomic that allows them to 
> read the term and role wait-free.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (KUDU-2727) Contention on the Raft consensus lock can cause tablet service queue overflows

2020-06-18 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-2727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin reassigned KUDU-2727:
---

Assignee: Alexey Serbin

> Contention on the Raft consensus lock can cause tablet service queue overflows
> --
>
> Key: KUDU-2727
> URL: https://issues.apache.org/jira/browse/KUDU-2727
> Project: Kudu
>  Issue Type: Improvement
>  Components: perf
>Reporter: William Berkeley
>Assignee: Alexey Serbin
>Priority: Major
>  Labels: performance, scalability
>
> Here's stacks illustrating the phenomenon:
> {noformat}
>   tids=[2201]
> 0x379ba0f710 
>0x1fb951a base::internal::SpinLockDelay()
>0x1fb93b7 base::SpinLock::SlowLock()
> 0xb4e68e kudu::consensus::Peer::SignalRequest()
> 0xb9c0df kudu::consensus::PeerManager::SignalRequest()
> 0xb8c178 kudu::consensus::RaftConsensus::Replicate()
> 0xaab816 kudu::tablet::TransactionDriver::Prepare()
> 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask()
>0x1fa37ed kudu::ThreadPool::DispatchThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
>   tids=[4515]
> 0x379ba0f710 
>0x1fb951a base::internal::SpinLockDelay()
>0x1fb93b7 base::SpinLock::SlowLock()
> 0xb74c60 kudu::consensus::RaftConsensus::NotifyCommitIndex()
> 0xb59307 kudu::consensus::PeerMessageQueue::NotifyObserversTask()
> 0xb54058 
> _ZN4kudu8internal7InvokerILi2ENS0_9BindStateINS0_15RunnableAdapterIMNS_9consensus16PeerMessageQueueEFvRKSt8functionIFvPNS4_24PeerMessageQueueObserverEEFvPS5_SC_EFvNS0_17UnretainedWrapperIS5_EEZNS5_34NotifyObserversOfCommitIndexChangeElEUlS8_E_EEESH_E3RunEPNS0_13BindStateBaseE
>0x1fa37ed kudu::ThreadPool::DispatchThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
>   tids=[22185,22194,22193,22188,22187,22186]
> 0x379ba0f710 
>0x1fb951a base::internal::SpinLockDelay()
>0x1fb93b7 base::SpinLock::SlowLock()
> 0xb8bff8 
> kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm()
> 0xaaaef9 kudu::tablet::TransactionDriver::ExecuteAsync()
> 0xaa3742 kudu::tablet::TabletReplica::SubmitWrite()
> 0x92812d kudu::tserver::TabletServiceImpl::Write()
>0x1e28f3c kudu::rpc::GeneratedServiceIf::Handle()
>0x1e2986a kudu::rpc::ServicePool::RunThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
>   tids=[22192,22191]
> 0x379ba0f710 
>0x1fb951a base::internal::SpinLockDelay()
>0x1fb93b7 base::SpinLock::SlowLock()
>0x1e13dec kudu::rpc::ResultTracker::TrackRpc()
>0x1e28ef5 kudu::rpc::GeneratedServiceIf::Handle()
>0x1e2986a kudu::rpc::ServicePool::RunThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
>   tids=[4426]
> 0x379ba0f710 
>0x206d3d0 
>0x212fd25 google::protobuf::Message::SpaceUsedLong()
>0x211dee4 
> google::protobuf::internal::GeneratedMessageReflection::SpaceUsedLong()
> 0xb6658e kudu::consensus::LogCache::AppendOperations()
> 0xb5c539 kudu::consensus::PeerMessageQueue::AppendOperations()
> 0xb5c7c7 kudu::consensus::PeerMessageQueue::AppendOperation()
> 0xb7c675 
> kudu::consensus::RaftConsensus::AppendNewRoundToQueueUnlocked()
> 0xb8c147 kudu::consensus::RaftConsensus::Replicate()
> 0xaab816 kudu::tablet::TransactionDriver::Prepare()
> 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask()
>0x1fa37ed kudu::ThreadPool::DispatchThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
> {noformat}
> {{kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm()}} needs to 
> take the lock to check the term and the Raft role. When many RPCs come in for 
> the same tablet, the contention can hog service threads and cause queue 
> overflows on busy systems.
> Yugabyte switched their equivalent lock to be an atomic that allows them to 
> read the term and role wait-free.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-2727) Contention on the Raft consensus lock can cause tablet service queue overflows

2020-06-18 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-2727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-2727:

Labels: performance scalability  (was: )

> Contention on the Raft consensus lock can cause tablet service queue overflows
> --
>
> Key: KUDU-2727
> URL: https://issues.apache.org/jira/browse/KUDU-2727
> Project: Kudu
>  Issue Type: Improvement
>  Components: perf
>Reporter: William Berkeley
>Priority: Major
>  Labels: performance, scalability
>
> Here's stacks illustrating the phenomenon:
> {noformat}
>   tids=[2201]
> 0x379ba0f710 
>0x1fb951a base::internal::SpinLockDelay()
>0x1fb93b7 base::SpinLock::SlowLock()
> 0xb4e68e kudu::consensus::Peer::SignalRequest()
> 0xb9c0df kudu::consensus::PeerManager::SignalRequest()
> 0xb8c178 kudu::consensus::RaftConsensus::Replicate()
> 0xaab816 kudu::tablet::TransactionDriver::Prepare()
> 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask()
>0x1fa37ed kudu::ThreadPool::DispatchThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
>   tids=[4515]
> 0x379ba0f710 
>0x1fb951a base::internal::SpinLockDelay()
>0x1fb93b7 base::SpinLock::SlowLock()
> 0xb74c60 kudu::consensus::RaftConsensus::NotifyCommitIndex()
> 0xb59307 kudu::consensus::PeerMessageQueue::NotifyObserversTask()
> 0xb54058 
> _ZN4kudu8internal7InvokerILi2ENS0_9BindStateINS0_15RunnableAdapterIMNS_9consensus16PeerMessageQueueEFvRKSt8functionIFvPNS4_24PeerMessageQueueObserverEEFvPS5_SC_EFvNS0_17UnretainedWrapperIS5_EEZNS5_34NotifyObserversOfCommitIndexChangeElEUlS8_E_EEESH_E3RunEPNS0_13BindStateBaseE
>0x1fa37ed kudu::ThreadPool::DispatchThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
>   tids=[22185,22194,22193,22188,22187,22186]
> 0x379ba0f710 
>0x1fb951a base::internal::SpinLockDelay()
>0x1fb93b7 base::SpinLock::SlowLock()
> 0xb8bff8 
> kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm()
> 0xaaaef9 kudu::tablet::TransactionDriver::ExecuteAsync()
> 0xaa3742 kudu::tablet::TabletReplica::SubmitWrite()
> 0x92812d kudu::tserver::TabletServiceImpl::Write()
>0x1e28f3c kudu::rpc::GeneratedServiceIf::Handle()
>0x1e2986a kudu::rpc::ServicePool::RunThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
>   tids=[22192,22191]
> 0x379ba0f710 
>0x1fb951a base::internal::SpinLockDelay()
>0x1fb93b7 base::SpinLock::SlowLock()
>0x1e13dec kudu::rpc::ResultTracker::TrackRpc()
>0x1e28ef5 kudu::rpc::GeneratedServiceIf::Handle()
>0x1e2986a kudu::rpc::ServicePool::RunThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
>   tids=[4426]
> 0x379ba0f710 
>0x206d3d0 
>0x212fd25 google::protobuf::Message::SpaceUsedLong()
>0x211dee4 
> google::protobuf::internal::GeneratedMessageReflection::SpaceUsedLong()
> 0xb6658e kudu::consensus::LogCache::AppendOperations()
> 0xb5c539 kudu::consensus::PeerMessageQueue::AppendOperations()
> 0xb5c7c7 kudu::consensus::PeerMessageQueue::AppendOperation()
> 0xb7c675 
> kudu::consensus::RaftConsensus::AppendNewRoundToQueueUnlocked()
> 0xb8c147 kudu::consensus::RaftConsensus::Replicate()
> 0xaab816 kudu::tablet::TransactionDriver::Prepare()
> 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask()
>0x1fa37ed kudu::ThreadPool::DispatchThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
> {noformat}
> {{kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm()}} needs to 
> take the lock to check the term and the Raft role. When many RPCs come in for 
> the same tablet, the contention can hog service threads and cause queue 
> overflows on busy systems.
> Yugabyte switched their equivalent lock to be an atomic that allows them to 
> read the term and role wait-free.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-2727) Contention on the Raft consensus lock can cause tablet service queue overflows

2020-06-18 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-2727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-2727:

Code Review: https://gerrit.cloudera.org/#/c/16034/

> Contention on the Raft consensus lock can cause tablet service queue overflows
> --
>
> Key: KUDU-2727
> URL: https://issues.apache.org/jira/browse/KUDU-2727
> Project: Kudu
>  Issue Type: Improvement
>  Components: perf
>Reporter: William Berkeley
>Assignee: Alexey Serbin
>Priority: Major
>  Labels: performance, scalability
>
> Here's stacks illustrating the phenomenon:
> {noformat}
>   tids=[2201]
> 0x379ba0f710 
>0x1fb951a base::internal::SpinLockDelay()
>0x1fb93b7 base::SpinLock::SlowLock()
> 0xb4e68e kudu::consensus::Peer::SignalRequest()
> 0xb9c0df kudu::consensus::PeerManager::SignalRequest()
> 0xb8c178 kudu::consensus::RaftConsensus::Replicate()
> 0xaab816 kudu::tablet::TransactionDriver::Prepare()
> 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask()
>0x1fa37ed kudu::ThreadPool::DispatchThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
>   tids=[4515]
> 0x379ba0f710 
>0x1fb951a base::internal::SpinLockDelay()
>0x1fb93b7 base::SpinLock::SlowLock()
> 0xb74c60 kudu::consensus::RaftConsensus::NotifyCommitIndex()
> 0xb59307 kudu::consensus::PeerMessageQueue::NotifyObserversTask()
> 0xb54058 
> _ZN4kudu8internal7InvokerILi2ENS0_9BindStateINS0_15RunnableAdapterIMNS_9consensus16PeerMessageQueueEFvRKSt8functionIFvPNS4_24PeerMessageQueueObserverEEFvPS5_SC_EFvNS0_17UnretainedWrapperIS5_EEZNS5_34NotifyObserversOfCommitIndexChangeElEUlS8_E_EEESH_E3RunEPNS0_13BindStateBaseE
>0x1fa37ed kudu::ThreadPool::DispatchThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
>   tids=[22185,22194,22193,22188,22187,22186]
> 0x379ba0f710 
>0x1fb951a base::internal::SpinLockDelay()
>0x1fb93b7 base::SpinLock::SlowLock()
> 0xb8bff8 
> kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm()
> 0xaaaef9 kudu::tablet::TransactionDriver::ExecuteAsync()
> 0xaa3742 kudu::tablet::TabletReplica::SubmitWrite()
> 0x92812d kudu::tserver::TabletServiceImpl::Write()
>0x1e28f3c kudu::rpc::GeneratedServiceIf::Handle()
>0x1e2986a kudu::rpc::ServicePool::RunThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
>   tids=[22192,22191]
> 0x379ba0f710 
>0x1fb951a base::internal::SpinLockDelay()
>0x1fb93b7 base::SpinLock::SlowLock()
>0x1e13dec kudu::rpc::ResultTracker::TrackRpc()
>0x1e28ef5 kudu::rpc::GeneratedServiceIf::Handle()
>0x1e2986a kudu::rpc::ServicePool::RunThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
>   tids=[4426]
> 0x379ba0f710 
>0x206d3d0 
>0x212fd25 google::protobuf::Message::SpaceUsedLong()
>0x211dee4 
> google::protobuf::internal::GeneratedMessageReflection::SpaceUsedLong()
> 0xb6658e kudu::consensus::LogCache::AppendOperations()
> 0xb5c539 kudu::consensus::PeerMessageQueue::AppendOperations()
> 0xb5c7c7 kudu::consensus::PeerMessageQueue::AppendOperation()
> 0xb7c675 
> kudu::consensus::RaftConsensus::AppendNewRoundToQueueUnlocked()
> 0xb8c147 kudu::consensus::RaftConsensus::Replicate()
> 0xaab816 kudu::tablet::TransactionDriver::Prepare()
> 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask()
>0x1fa37ed kudu::ThreadPool::DispatchThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
> {noformat}
> {{kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm()}} needs to 
> take the lock to check the term and the Raft role. When many RPCs come in for 
> the same tablet, the contention can hog service threads and cause queue 
> overflows on busy systems.
> Yugabyte switched their equivalent lock to be an atomic that allows them to 
> read the term and role wait-free.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-3129) ToolTest.TestHmsList can timeout

2020-06-16 Thread Alexey Serbin (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17136729#comment-17136729
 ] 

Alexey Serbin commented on KUDU-3129:
-

The test also times out in case of RELEASE builds: the log is attached. 
[^kudu-tool-test.2.txt.xz] 

> ToolTest.TestHmsList can timeout
> 
>
> Key: KUDU-3129
> URL: https://issues.apache.org/jira/browse/KUDU-3129
> Project: Kudu
>  Issue Type: Bug
>  Components: hms, test
>Affects Versions: 1.12.0
>Reporter: Andrew Wong
>Priority: Major
> Attachments: kudu-tool-test.2.txt, kudu-tool-test.2.txt.xz
>
>
> When running in TSAN mode, the test timed out, spending 10 minutes not really 
> doing anything. It isn't obvious why, but ToolTest.TestHmsList can timeout, 
> appearing to hang while running the HMS tool.
> {code}
> I0521 22:31:49.436857  4601 catalog_manager.cc:1161] Initializing in-progress 
> tserver states...
> I0521 22:31:49.446161  4606 hms_notification_log_listener.cc:228] Skipping 
> Hive Metastore notification log poll: Service unavailable: Catalog manager is 
> not initialized. State: Starting
> I0521 22:31:49.839709  4488 heartbeater.cc:325] Connected to a master server 
> at 127.0.89.254:42487
> I0521 22:31:49.845547  4559 master_service.cc:295] Got heartbeat from unknown 
> tserver (permanent_uuid: "cf9e08c4271e4d9aa28b1aacbd630908" instance_seqno: 
> 1590100304311876) as {username='slave'} at 127.0.89.193:33867; Asking this 
> server to re-register.
> I0521 22:31:49.846786  4488 heartbeater.cc:416] Registering TS with master...
> I0521 22:31:49.847297  4488 heartbeater.cc:465] Master 127.0.89.254:42487 
> requested a full tablet report, sending...
> I0521 22:31:49.849771  4559 ts_manager.cc:191] Registered new tserver with 
> Master: cf9e08c4271e4d9aa28b1aacbd630908 (127.0.89.193:43527)
> I0521 22:31:49.852535   359 external_mini_cluster.cc:699] 1 TS(s) registered 
> with all masters
> W0521 22:32:23.142868  4545 debug-util.cc:402] Leaking SignalData structure 
> 0x7b080002b060 after lost signal to thread 4531
> W0521 22:32:23.14  4545 debug-util.cc:402] Leaking SignalData structure 
> 0x7b080002b780 after lost signal to thread 4591
> W0521 22:32:28.996440  4545 debug-util.cc:402] Leaking SignalData structure 
> 0x7b080002b740 after lost signal to thread 4531
> W0521 22:32:28.996966  4545 debug-util.cc:402] Leaking SignalData structure 
> 0x7b080002b520 after lost signal to thread 4591
> W0521 22:33:05.743249  4380 debug-util.cc:402] Leaking SignalData structure 
> 0x7b080002aae0 after lost signal to thread 4360
> W0521 22:33:05.743983  4380 debug-util.cc:402] Leaking SignalData structure 
> 0x7b080002af00 after lost signal to thread 4486
> I0521 22:33:49.594769  4549 maintenance_manager.cc:326] P 
> c3cc85c33a5447b2aa520019fe162966: Scheduling 
> FlushMRSOp(): perf score=0.033386
> I0521 22:33:49.637208  4548 maintenance_manager.cc:525] P 
> c3cc85c33a5447b2aa520019fe162966: 
> FlushMRSOp() complete. Timing: real 0.042s
> user 0.032s sys 0.008s Metrics: 
> {"bytes_written":6485,"cfile_init":1,"dirs.queue_time_us":675,"dirs.run_cpu_time_us":237,"dirs.run_wall_time_us":997,"drs_written":1,"lbm_read_time_us":231,"lbm_reads_lt_1ms":4,"lbm_write_time_us":1980,"lbm_writes_lt_1ms":27,"rows_written":5,"thread_start_us":953,"threads_started":2,"wal-append.queue_time_us":819}
> I0521 22:33:49.639096  4549 maintenance_manager.cc:326] P 
> c3cc85c33a5447b2aa520019fe162966: Scheduling 
> UndoDeltaBlockGCOp(): 396 bytes on disk
> I0521 22:33:49.640486  4548 maintenance_manager.cc:525] P 
> c3cc85c33a5447b2aa520019fe162966: 
> UndoDeltaBlockGCOp() complete. Timing: real 
> 0.001suser 0.001s sys 0.000s Metrics: 
> {"cfile_init":1,"lbm_read_time_us":269,"lbm_reads_lt_1ms":4}
> W0521 22:34:17.794472  4380 debug-util.cc:402] Leaking SignalData structure 
> 0x7b080002ade0 after lost signal to thread 4360
> W0521 22:34:17.795437  4380 debug-util.cc:402] Leaking SignalData structure 
> 0x7b080002a7e0 after lost signal to thread 4486
> W0521 22:34:20.286921  4545 debug-util.cc:402] Leaking SignalData structure 
> 0x7b080002b2e0 after lost signal to thread 4531
> W0521 22:34:20.287376  4545 debug-util.cc:402] Leaking SignalData structure 
> 0x7b080002b140 after lost signal to thread 4591
> W0521 22:35:27.726336  4380 debug-util.cc:402] Leaking SignalData structure 
> 0x7b080002af40 after lost signal to thread 4360
> W0521 22:35:27.727084  4380 debug-util.cc:402] Leaking SignalData structure 
> 0x7b080002a980 after lost signal to thread 4486
> W0521 22:36:12.250830  4545 debug-util.cc:402] Leaking SignalData structure 
> 0x7b080002b9c0 after lost signal to thread 4531
> W0521 22:36:12.25124

[jira] [Updated] (KUDU-3129) ToolTest.TestHmsList can timeout

2020-06-16 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3129:

Attachment: kudu-tool-test.2.txt.xz

> ToolTest.TestHmsList can timeout
> 
>
> Key: KUDU-3129
> URL: https://issues.apache.org/jira/browse/KUDU-3129
> Project: Kudu
>  Issue Type: Bug
>  Components: hms, test
>Affects Versions: 1.12.0
>Reporter: Andrew Wong
>Priority: Major
> Attachments: kudu-tool-test.2.txt, kudu-tool-test.2.txt.xz
>
>
> When running in TSAN mode, the test timed out, spending 10 minutes not really 
> doing anything. It isn't obvious why, but ToolTest.TestHmsList can timeout, 
> appearing to hang while running the HMS tool.
> {code}
> I0521 22:31:49.436857  4601 catalog_manager.cc:1161] Initializing in-progress 
> tserver states...
> I0521 22:31:49.446161  4606 hms_notification_log_listener.cc:228] Skipping 
> Hive Metastore notification log poll: Service unavailable: Catalog manager is 
> not initialized. State: Starting
> I0521 22:31:49.839709  4488 heartbeater.cc:325] Connected to a master server 
> at 127.0.89.254:42487
> I0521 22:31:49.845547  4559 master_service.cc:295] Got heartbeat from unknown 
> tserver (permanent_uuid: "cf9e08c4271e4d9aa28b1aacbd630908" instance_seqno: 
> 1590100304311876) as {username='slave'} at 127.0.89.193:33867; Asking this 
> server to re-register.
> I0521 22:31:49.846786  4488 heartbeater.cc:416] Registering TS with master...
> I0521 22:31:49.847297  4488 heartbeater.cc:465] Master 127.0.89.254:42487 
> requested a full tablet report, sending...
> I0521 22:31:49.849771  4559 ts_manager.cc:191] Registered new tserver with 
> Master: cf9e08c4271e4d9aa28b1aacbd630908 (127.0.89.193:43527)
> I0521 22:31:49.852535   359 external_mini_cluster.cc:699] 1 TS(s) registered 
> with all masters
> W0521 22:32:23.142868  4545 debug-util.cc:402] Leaking SignalData structure 
> 0x7b080002b060 after lost signal to thread 4531
> W0521 22:32:23.14  4545 debug-util.cc:402] Leaking SignalData structure 
> 0x7b080002b780 after lost signal to thread 4591
> W0521 22:32:28.996440  4545 debug-util.cc:402] Leaking SignalData structure 
> 0x7b080002b740 after lost signal to thread 4531
> W0521 22:32:28.996966  4545 debug-util.cc:402] Leaking SignalData structure 
> 0x7b080002b520 after lost signal to thread 4591
> W0521 22:33:05.743249  4380 debug-util.cc:402] Leaking SignalData structure 
> 0x7b080002aae0 after lost signal to thread 4360
> W0521 22:33:05.743983  4380 debug-util.cc:402] Leaking SignalData structure 
> 0x7b080002af00 after lost signal to thread 4486
> I0521 22:33:49.594769  4549 maintenance_manager.cc:326] P 
> c3cc85c33a5447b2aa520019fe162966: Scheduling 
> FlushMRSOp(): perf score=0.033386
> I0521 22:33:49.637208  4548 maintenance_manager.cc:525] P 
> c3cc85c33a5447b2aa520019fe162966: 
> FlushMRSOp() complete. Timing: real 0.042s
> user 0.032s sys 0.008s Metrics: 
> {"bytes_written":6485,"cfile_init":1,"dirs.queue_time_us":675,"dirs.run_cpu_time_us":237,"dirs.run_wall_time_us":997,"drs_written":1,"lbm_read_time_us":231,"lbm_reads_lt_1ms":4,"lbm_write_time_us":1980,"lbm_writes_lt_1ms":27,"rows_written":5,"thread_start_us":953,"threads_started":2,"wal-append.queue_time_us":819}
> I0521 22:33:49.639096  4549 maintenance_manager.cc:326] P 
> c3cc85c33a5447b2aa520019fe162966: Scheduling 
> UndoDeltaBlockGCOp(): 396 bytes on disk
> I0521 22:33:49.640486  4548 maintenance_manager.cc:525] P 
> c3cc85c33a5447b2aa520019fe162966: 
> UndoDeltaBlockGCOp() complete. Timing: real 
> 0.001suser 0.001s sys 0.000s Metrics: 
> {"cfile_init":1,"lbm_read_time_us":269,"lbm_reads_lt_1ms":4}
> W0521 22:34:17.794472  4380 debug-util.cc:402] Leaking SignalData structure 
> 0x7b080002ade0 after lost signal to thread 4360
> W0521 22:34:17.795437  4380 debug-util.cc:402] Leaking SignalData structure 
> 0x7b080002a7e0 after lost signal to thread 4486
> W0521 22:34:20.286921  4545 debug-util.cc:402] Leaking SignalData structure 
> 0x7b080002b2e0 after lost signal to thread 4531
> W0521 22:34:20.287376  4545 debug-util.cc:402] Leaking SignalData structure 
> 0x7b080002b140 after lost signal to thread 4591
> W0521 22:35:27.726336  4380 debug-util.cc:402] Leaking SignalData structure 
> 0x7b080002af40 after lost signal to thread 4360
> W0521 22:35:27.727084  4380 debug-util.cc:402] Leaking SignalData structure 
> 0x7b080002a980 after lost signal to thread 4486
> W0521 22:36:12.250830  4545 debug-util.cc:402] Leaking SignalData structure 
> 0x7b080002b9c0 after lost signal to thread 4531
> W0521 22:36:12.251247  4545 debug-util.cc:402] Leaking SignalData structure 
> 0x7b080002b220 after lost signal to thread 4591
> W0521 22:3

[jira] [Resolved] (KUDU-3145) KUDU_LINK should be set before function APPEND_LINKER_FLAGS is called

2020-06-12 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin resolved KUDU-3145.
-
Fix Version/s: 1.13.0
   Resolution: Fixed

> KUDU_LINK should be set before function APPEND_LINKER_FLAGS is called
> -
>
> Key: KUDU-3145
> URL: https://issues.apache.org/jira/browse/KUDU-3145
> Project: Kudu
>  Issue Type: Sub-task
>Reporter: zhaorenhai
>Assignee: huangtianhua
>Priority: Major
> Fix For: 1.13.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> KUDU_LINK should be set before function APPEND_LINKER_FLAGS is called
>  
> Because in function APPEND_LINKER_FLAGS , there are following logic:
> {code:java}
> if ("${LINKER_FAMILY}" STREQUAL "gold")
>   if("${LINKER_VERSION}" VERSION_LESS "1.12" AND
>  "${KUDU_LINK}" STREQUAL "d")
> message(WARNING "Skipping gold <1.12 with dynamic linking.")
> continue()
>   endif()
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-2727) Contention on the Raft consensus lock can cause tablet service queue overflows

2020-06-04 Thread Alexey Serbin (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-2727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17126243#comment-17126243
 ] 

Alexey Serbin commented on KUDU-2727:
-

One more set of stack traces:

{noformat}
  tids=[1324418]
  0x7f61b79fc5e0 
   0x1ec35f4 base::internal::SpinLockDelay()
   0x1ec347c base::SpinLock::SlowLock()
0xb4236d kudu::consensus::Peer::SendNextRequest()
0xb43771 
_ZN5boost6detail8function26void_function_obj_invoker0IZN4kudu9consensus4Peer13SignalRequestEbEUlvE_vE6invokeERNS1_15function_bufferE
   0x1eb1d1d kudu::FunctionRunnable::Run()
   0x1eaedbf kudu::ThreadPool::DispatchThread()
   0x1ea4a84 kudu::Thread::SuperviseThread()
  0x7f61b79f4e25 start_thread
  0x7f61b5cd234d __clone
  
tids=[93293,93284,93285,93286,93287,93288,93289,93290,93291,93292,93304,93294,93295,93296,93297,93298,93299,93300,93301,93302,93303,93313,93322,93321,93320,93319,93318,93317,93316,93315,93314,93283,93312,93311,93310,93309,93308,93307,93306,93305]
  0x7f61b79fc5e0 
   0x1ec35f4 base::internal::SpinLockDelay()
   0x1ec347c base::SpinLock::SlowLock()
0xb7deb8 
kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm()
0xaab010 kudu::tablet::TransactionDriver::ExecuteAsync()
0xaa344c kudu::tablet::TabletReplica::SubmitWrite()
0x928fb0 kudu::tserver::TabletServiceImpl::Write()
   0x1d2e8d9 kudu::rpc::GeneratedServiceIf::Handle()
   0x1d2efd9 kudu::rpc::ServicePool::RunThread()
   0x1ea4a84 kudu::Thread::SuperviseThread()
  0x7f61b79f4e25 start_thread
  0x7f61b5cd234d __clone
  tids=[1324661]
  0x7f61b79fc5e0 
   0x1ec35f4 base::internal::SpinLockDelay()
   0x1ec347c base::SpinLock::SlowLock()
0xb7df8e kudu::consensus::RaftConsensus::Replicate()
0xaab8e7 kudu::tablet::TransactionDriver::Prepare()
0xaac009 kudu::tablet::TransactionDriver::PrepareTask()
   0x1eaedbf kudu::ThreadPool::DispatchThread()
   0x1ea4a84 kudu::Thread::SuperviseThread()
  0x7f61b79f4e25 start_thread
  0x7f61b5cd234d __clone
  tids=[93383]
  0x7f61b79fc5e0 
  0x7f61b79f8cf2 __pthread_cond_timedwait
   0x1dfcfa9 kudu::ConditionVariable::WaitUntil()
0xb73bc7 kudu::consensus::RaftConsensus::UpdateReplica()
0xb75128 kudu::consensus::RaftConsensus::Update()
0x92c5d1 kudu::tserver::ConsensusServiceImpl::UpdateConsensus()
   0x1d2e8d9 kudu::rpc::GeneratedServiceIf::Handle()
   0x1d2efd9 kudu::rpc::ServicePool::RunThread()
   0x1ea4a84 kudu::Thread::SuperviseThread()
  0x7f61b79f4e25 start_thread
  0x7f61b5cd234d __clone
{noformat}

Thread {{93383}} holds the lock, waiting on another conditional variable and 
blocks many other threads.

> Contention on the Raft consensus lock can cause tablet service queue overflows
> --
>
> Key: KUDU-2727
> URL: https://issues.apache.org/jira/browse/KUDU-2727
> Project: Kudu
>  Issue Type: Improvement
>  Components: perf
>Reporter: William Berkeley
>Priority: Major
>
> Here's stacks illustrating the phenomenon:
> {noformat}
>   tids=[2201]
> 0x379ba0f710 
>0x1fb951a base::internal::SpinLockDelay()
>0x1fb93b7 base::SpinLock::SlowLock()
> 0xb4e68e kudu::consensus::Peer::SignalRequest()
> 0xb9c0df kudu::consensus::PeerManager::SignalRequest()
> 0xb8c178 kudu::consensus::RaftConsensus::Replicate()
> 0xaab816 kudu::tablet::TransactionDriver::Prepare()
> 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask()
>0x1fa37ed kudu::ThreadPool::DispatchThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
>   tids=[4515]
> 0x379ba0f710 
>0x1fb951a base::internal::SpinLockDelay()
>0x1fb93b7 base::SpinLock::SlowLock()
> 0xb74c60 kudu::consensus::RaftConsensus::NotifyCommitIndex()
> 0xb59307 kudu::consensus::PeerMessageQueue::NotifyObserversTask()
> 0xb54058 
> _ZN4kudu8internal7InvokerILi2ENS0_9BindStateINS0_15RunnableAdapterIMNS_9consensus16PeerMessageQueueEFvRKSt8functionIFvPNS4_24PeerMessageQueueObserverEEFvPS5_SC_EFvNS0_17UnretainedWrapperIS5_EEZNS5_34NotifyObserversOfCommitIndexChangeElEUlS8_E_EEESH_E3RunEPNS0_13BindStateBaseE
>0x1fa37ed kudu::ThreadPool::DispatchThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
>   tids=[22185,22194,22193,22188,22187,22186]
> 0x379ba0f710 
>0x1fb951a base::

[jira] [Comment Edited] (KUDU-2727) Contention on the Raft consensus lock can cause tablet service queue overflows

2020-06-03 Thread Alexey Serbin (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-2727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17125466#comment-17125466
 ] 

Alexey Serbin edited comment on KUDU-2727 at 6/4/20, 3:14 AM:
--

Another set of stacks, just for more context (captured with code close to kudu 
1.10.1):

{noformat}
  tids=[1866940]
  0x7fc8d67f95e0 
   0x1ec35f4 base::internal::SpinLockDelay()
   0x1ec347c base::SpinLock::SlowLock()
0xb68518 kudu::consensus::RaftConsensus::NotifyCommitIndex()
0xb4c9e7 kudu::consensus::PeerMessageQueue::NotifyObserversTask()
0xb47850 
_ZN4kudu8internal7InvokerILi2ENS0_9BindStateINS0_15RunnableAdapterIMNS_9consensus16PeerMessageQueueEFvRKSt8functionIFvPNS4_24PeerMessageQueueObserverEEFvPS5_SC_EFvNS0_17UnretainedWrapperIS5_EEZNS5_34NotifyObserversOfCommitIndexChangeElEUlS8_E_EEESH_E3RunEPNS0_13BindStateBaseE
   0x1eaedbf kudu::ThreadPool::DispatchThread()
   0x1ea4a84 kudu::Thread::SuperviseThread()
  0x7fc8d67f1e25 start_thread
  0x7fc8d4acf34d __clone
  
tids=[1370336,1370326,1370327,1370328,1370329,1370330,1370331,1370332,1370333,1370334,1370335,1370323,1370337,1370338,1370339,1370340,1370341,1370342,1370343,1370344,1370345,1370346,1370353,1370361,1370360,1370359,1370358,1370357,1370356,1370355,1370354,1370325,1370352,1370351,1370350,1370349,1370348,1370347,1370322,1370324]
  0x7fc8d67f95e0 
   0x1ec35f4 base::internal::SpinLockDelay()
   0x1ec347c base::SpinLock::SlowLock()
0xb7deb8 
kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm()
0xaab010 kudu::tablet::TransactionDriver::ExecuteAsync()
0xaa344c kudu::tablet::TabletReplica::SubmitWrite()
0x928fb0 kudu::tserver::TabletServiceImpl::Write()
   0x1d2e8d9 kudu::rpc::GeneratedServiceIf::Handle()
   0x1d2efd9 kudu::rpc::ServicePool::RunThread()
   0x1ea4a84 kudu::Thread::SuperviseThread()
  0x7fc8d67f1e25 start_thread
  0x7fc8d4acf34d __clone
  tids=[1866932,1866929]
  0x7fc8d67f95e0 
  0x7fc8d67f5943 __pthread_cond_wait
0xb99b38 kudu::log::Log::AsyncAppend()
0xb9c24c kudu::log::Log::AsyncAppendCommit()
0xaad489 kudu::tablet::TransactionDriver::ApplyTask()
   0x1eaedbf kudu::ThreadPool::DispatchThread()
   0x1ea4a84 kudu::Thread::SuperviseThread()
  0x7fc8d67f1e25 start_thread
  0x7fc8d4acf34d __clone
  tids=[1866928]
  0x7fc8d67f95e0 
  0x7fc8d67f5943 __pthread_cond_wait
0xb99b38 kudu::log::Log::AsyncAppend()
0xb9c493 kudu::log::Log::AsyncAppendReplicates()
0xb597e9 kudu::consensus::LogCache::AppendOperations()
0xb4fa24 kudu::consensus::PeerMessageQueue::AppendOperations()
0xb4fd45 kudu::consensus::PeerMessageQueue::AppendOperation()
0xb6f28c 
kudu::consensus::RaftConsensus::AppendNewRoundToQueueUnlocked()
0xb7dff8 kudu::consensus::RaftConsensus::Replicate()
0xaab8e7 kudu::tablet::TransactionDriver::Prepare()
0xaac009 kudu::tablet::TransactionDriver::PrepareTask()
   0x1eaedbf kudu::ThreadPool::DispatchThread()
   0x1ea4a84 kudu::Thread::SuperviseThread()
{noformat}

In the stacks above, thread {{1866928}} is holding a lock taken in 
{{RaftConsensus::Replicate()}} while waiting on a condition variable in 
{{Log::AsyncAppend()}}, calling 
{{entry_batch_queue_.BlockingPut(entry_batch.get())}}.


was (Author: aserbin):
Another set of stacks, just for more context (captured with code close to kudu 
1.10.1):

{noformat}
  tids=[1866940]
  0x7fc8d67f95e0 
   0x1ec35f4 base::internal::SpinLockDelay()
   0x1ec347c base::SpinLock::SlowLock()
0xb68518 kudu::consensus::RaftConsensus::NotifyCommitIndex()
0xb4c9e7 kudu::consensus::PeerMessageQueue::NotifyObserversTask()
0xb47850 
_ZN4kudu8internal7InvokerILi2ENS0_9BindStateINS0_15RunnableAdapterIMNS_9consensus16PeerMessageQueueEFvRKSt8functionIFvPNS4_24PeerMessageQueueObserverEEFvPS5_SC_EFvNS0_17UnretainedWrapperIS5_EEZNS5_34NotifyObserversOfCommitIndexChangeElEUlS8_E_EEESH_E3RunEPNS0_13BindStateBaseE
   0x1eaedbf kudu::ThreadPool::DispatchThread()
   0x1ea4a84 kudu::Thread::SuperviseThread()
  0x7fc8d67f1e25 start_thread
  0x7fc8d4acf34d __clone
  
tids=[1370336,1370326,1370327,1370328,1370329,1370330,1370331,1370332,1370333,1370334,1370335,1370323,1370337,1370338,1370339,1370340,1370341,1370342,1370343,1370344,1370345,1370346,1370353,1370361,1370360,1370359,1370358,1370357,1370356,1370355,1370354,1370325,1370352,1370351,1370350,1370349,1370348,1370347,1370322,1370324]
  0x7fc8d67f95e0 
   0x1ec35f4 base::internal::SpinLockDelay()
   0x1ec347c base::SpinLock::SlowLock()
0xb7deb8 
kudu::consensus::RaftCo

[jira] [Assigned] (KUDU-2727) Contention on the Raft consensus lock can cause tablet service queue overflows

2020-06-03 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-2727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin reassigned KUDU-2727:
---

Assignee: (was: Mike Percy)

> Contention on the Raft consensus lock can cause tablet service queue overflows
> --
>
> Key: KUDU-2727
> URL: https://issues.apache.org/jira/browse/KUDU-2727
> Project: Kudu
>  Issue Type: Improvement
>  Components: perf
>Reporter: William Berkeley
>Priority: Major
>
> Here's stacks illustrating the phenomenon:
> {noformat}
>   tids=[2201]
> 0x379ba0f710 
>0x1fb951a base::internal::SpinLockDelay()
>0x1fb93b7 base::SpinLock::SlowLock()
> 0xb4e68e kudu::consensus::Peer::SignalRequest()
> 0xb9c0df kudu::consensus::PeerManager::SignalRequest()
> 0xb8c178 kudu::consensus::RaftConsensus::Replicate()
> 0xaab816 kudu::tablet::TransactionDriver::Prepare()
> 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask()
>0x1fa37ed kudu::ThreadPool::DispatchThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
>   tids=[4515]
> 0x379ba0f710 
>0x1fb951a base::internal::SpinLockDelay()
>0x1fb93b7 base::SpinLock::SlowLock()
> 0xb74c60 kudu::consensus::RaftConsensus::NotifyCommitIndex()
> 0xb59307 kudu::consensus::PeerMessageQueue::NotifyObserversTask()
> 0xb54058 
> _ZN4kudu8internal7InvokerILi2ENS0_9BindStateINS0_15RunnableAdapterIMNS_9consensus16PeerMessageQueueEFvRKSt8functionIFvPNS4_24PeerMessageQueueObserverEEFvPS5_SC_EFvNS0_17UnretainedWrapperIS5_EEZNS5_34NotifyObserversOfCommitIndexChangeElEUlS8_E_EEESH_E3RunEPNS0_13BindStateBaseE
>0x1fa37ed kudu::ThreadPool::DispatchThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
>   tids=[22185,22194,22193,22188,22187,22186]
> 0x379ba0f710 
>0x1fb951a base::internal::SpinLockDelay()
>0x1fb93b7 base::SpinLock::SlowLock()
> 0xb8bff8 
> kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm()
> 0xaaaef9 kudu::tablet::TransactionDriver::ExecuteAsync()
> 0xaa3742 kudu::tablet::TabletReplica::SubmitWrite()
> 0x92812d kudu::tserver::TabletServiceImpl::Write()
>0x1e28f3c kudu::rpc::GeneratedServiceIf::Handle()
>0x1e2986a kudu::rpc::ServicePool::RunThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
>   tids=[22192,22191]
> 0x379ba0f710 
>0x1fb951a base::internal::SpinLockDelay()
>0x1fb93b7 base::SpinLock::SlowLock()
>0x1e13dec kudu::rpc::ResultTracker::TrackRpc()
>0x1e28ef5 kudu::rpc::GeneratedServiceIf::Handle()
>0x1e2986a kudu::rpc::ServicePool::RunThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
>   tids=[4426]
> 0x379ba0f710 
>0x206d3d0 
>0x212fd25 google::protobuf::Message::SpaceUsedLong()
>0x211dee4 
> google::protobuf::internal::GeneratedMessageReflection::SpaceUsedLong()
> 0xb6658e kudu::consensus::LogCache::AppendOperations()
> 0xb5c539 kudu::consensus::PeerMessageQueue::AppendOperations()
> 0xb5c7c7 kudu::consensus::PeerMessageQueue::AppendOperation()
> 0xb7c675 
> kudu::consensus::RaftConsensus::AppendNewRoundToQueueUnlocked()
> 0xb8c147 kudu::consensus::RaftConsensus::Replicate()
> 0xaab816 kudu::tablet::TransactionDriver::Prepare()
> 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask()
>0x1fa37ed kudu::ThreadPool::DispatchThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
> {noformat}
> {{kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm()}} needs to 
> take the lock to check the term and the Raft role. When many RPCs come in for 
> the same tablet, the contention can hog service threads and cause queue 
> overflows on busy systems.
> Yugabyte switched their equivalent lock to be an atomic that allows them to 
> read the term and role wait-free.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-2727) Contention on the Raft consensus lock can cause tablet service queue overflows

2020-06-03 Thread Alexey Serbin (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-2727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17125466#comment-17125466
 ] 

Alexey Serbin commented on KUDU-2727:
-

Another set of stacks, just for more context (captured with code close to kudu 
1.10.1):

{noformat}
  tids=[1866940]
  0x7fc8d67f95e0 
   0x1ec35f4 base::internal::SpinLockDelay()
   0x1ec347c base::SpinLock::SlowLock()
0xb68518 kudu::consensus::RaftConsensus::NotifyCommitIndex()
0xb4c9e7 kudu::consensus::PeerMessageQueue::NotifyObserversTask()
0xb47850 
_ZN4kudu8internal7InvokerILi2ENS0_9BindStateINS0_15RunnableAdapterIMNS_9consensus16PeerMessageQueueEFvRKSt8functionIFvPNS4_24PeerMessageQueueObserverEEFvPS5_SC_EFvNS0_17UnretainedWrapperIS5_EEZNS5_34NotifyObserversOfCommitIndexChangeElEUlS8_E_EEESH_E3RunEPNS0_13BindStateBaseE
   0x1eaedbf kudu::ThreadPool::DispatchThread()
   0x1ea4a84 kudu::Thread::SuperviseThread()
  0x7fc8d67f1e25 start_thread
  0x7fc8d4acf34d __clone
  
tids=[1370336,1370326,1370327,1370328,1370329,1370330,1370331,1370332,1370333,1370334,1370335,1370323,1370337,1370338,1370339,1370340,1370341,1370342,1370343,1370344,1370345,1370346,1370353,1370361,1370360,1370359,1370358,1370357,1370356,1370355,1370354,1370325,1370352,1370351,1370350,1370349,1370348,1370347,1370322,1370324]
  0x7fc8d67f95e0 
   0x1ec35f4 base::internal::SpinLockDelay()
   0x1ec347c base::SpinLock::SlowLock()
0xb7deb8 
kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm()
0xaab010 kudu::tablet::TransactionDriver::ExecuteAsync()
0xaa344c kudu::tablet::TabletReplica::SubmitWrite()
0x928fb0 kudu::tserver::TabletServiceImpl::Write()
   0x1d2e8d9 kudu::rpc::GeneratedServiceIf::Handle()
   0x1d2efd9 kudu::rpc::ServicePool::RunThread()
   0x1ea4a84 kudu::Thread::SuperviseThread()
  0x7fc8d67f1e25 start_thread
  0x7fc8d4acf34d __clone
  tids=[1866932,1866929]
  0x7fc8d67f95e0 
  0x7fc8d67f5943 __pthread_cond_wait
0xb99b38 kudu::log::Log::AsyncAppend()
0xb9c24c kudu::log::Log::AsyncAppendCommit()
0xaad489 kudu::tablet::TransactionDriver::ApplyTask()
   0x1eaedbf kudu::ThreadPool::DispatchThread()
   0x1ea4a84 kudu::Thread::SuperviseThread()
  0x7fc8d67f1e25 start_thread
  0x7fc8d4acf34d __clone
  tids=[1866928]
  0x7fc8d67f95e0 
  0x7fc8d67f5943 __pthread_cond_wait
0xb99b38 kudu::log::Log::AsyncAppend()
0xb9c493 kudu::log::Log::AsyncAppendReplicates()
0xb597e9 kudu::consensus::LogCache::AppendOperations()
0xb4fa24 kudu::consensus::PeerMessageQueue::AppendOperations()
0xb4fd45 kudu::consensus::PeerMessageQueue::AppendOperation()
0xb6f28c 
kudu::consensus::RaftConsensus::AppendNewRoundToQueueUnlocked()
0xb7dff8 kudu::consensus::RaftConsensus::Replicate()
0xaab8e7 kudu::tablet::TransactionDriver::Prepare()
0xaac009 kudu::tablet::TransactionDriver::PrepareTask()
   0x1eaedbf kudu::ThreadPool::DispatchThread()
   0x1ea4a84 kudu::Thread::SuperviseThread()
{noformat}

> Contention on the Raft consensus lock can cause tablet service queue overflows
> --
>
> Key: KUDU-2727
> URL: https://issues.apache.org/jira/browse/KUDU-2727
> Project: Kudu
>  Issue Type: Improvement
>  Components: perf
>Reporter: William Berkeley
>Assignee: Mike Percy
>Priority: Major
>
> Here's stacks illustrating the phenomenon:
> {noformat}
>   tids=[2201]
> 0x379ba0f710 
>0x1fb951a base::internal::SpinLockDelay()
>0x1fb93b7 base::SpinLock::SlowLock()
> 0xb4e68e kudu::consensus::Peer::SignalRequest()
> 0xb9c0df kudu::consensus::PeerManager::SignalRequest()
> 0xb8c178 kudu::consensus::RaftConsensus::Replicate()
> 0xaab816 kudu::tablet::TransactionDriver::Prepare()
> 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask()
>0x1fa37ed kudu::ThreadPool::DispatchThread()
>0x1f9c2a1 kudu::Thread::SuperviseThread()
> 0x379ba079d1 start_thread
> 0x379b6e88fd clone
>   tids=[4515]
> 0x379ba0f710 
>0x1fb951a base::internal::SpinLockDelay()
>0x1fb93b7 base::SpinLock::SlowLock()
> 0xb74c60 kudu::consensus::RaftConsensus::NotifyCommitIndex()
> 0xb59307 kudu::consensus::PeerMessageQueue::NotifyObserversTask()
> 0xb54058 
> _ZN4kudu8internal7InvokerILi2ENS0_9BindStateINS0_15RunnableAdapterIMNS_9consensus16PeerMessageQueueEFvRKSt8functionIFvPNS4_24PeerMessageQueueObserverEEFvPS5_SC

[jira] [Updated] (KUDU-2781) Hardening for location awareness command-line flag

2020-06-03 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-2781:

Component/s: master

> Hardening for location awareness command-line flag
> --
>
> Key: KUDU-2781
> URL: https://issues.apache.org/jira/browse/KUDU-2781
> Project: Kudu
>  Issue Type: Improvement
>  Components: master
>Reporter: Alexey Serbin
>Assignee: Alexey Serbin
>Priority: Major
>
> Add few verification steps related to the location assignment:
> * the location assignment executable is present and executable
> * the location assignment executable conforms with the expected interface: it 
> accepts one argument (IP address or DNS name) and outputs the assigned 
> location into the stdout
> * the same DNS name/IP address assigned the same location
> * the result location output into the stdout conforms with the format for 
> locations in Kudu
> It's possible to implement these in {{kudu-master}}  using group flag 
> validators: see the {{GROUP_FLAG_VALIDATOR}} macro.
> Performing few verification steps mentioned above should help to avoid 
> situations when Kudu tablet servers cannot be registered with Kudu master if 
> the location assignment executable path is misspelled or the executable 
> behaves not as expected.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-2781) Hardening for location awareness command-line flag

2020-06-03 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-2781:

Description: 
Add few verification steps related to the location assignment:
* the location assignment executable is present and executable
* the location assignment executable conforms with the expected interface: it 
accepts one argument (IP address or DNS name) and outputs the assigned location 
into the stdout
* the same DNS name/IP address assigned the same location
* the result location output into the stdout conforms with the format for 
locations in Kudu

It's possible to implement these in {{kudu-master}}  using group flag 
validators: see the {{GROUP_FLAG_VALIDATOR}} macro.

Performing few verification steps mentioned above should help to avoid 
situations when Kudu tablet servers cannot be registered with Kudu master if 
the location assignment executable path is misspelled or the executable behaves 
not as expected.

> Hardening for location awareness command-line flag
> --
>
> Key: KUDU-2781
> URL: https://issues.apache.org/jira/browse/KUDU-2781
> Project: Kudu
>  Issue Type: Improvement
>Reporter: Alexey Serbin
>Assignee: Alexey Serbin
>Priority: Major
>
> Add few verification steps related to the location assignment:
> * the location assignment executable is present and executable
> * the location assignment executable conforms with the expected interface: it 
> accepts one argument (IP address or DNS name) and outputs the assigned 
> location into the stdout
> * the same DNS name/IP address assigned the same location
> * the result location output into the stdout conforms with the format for 
> locations in Kudu
> It's possible to implement these in {{kudu-master}}  using group flag 
> validators: see the {{GROUP_FLAG_VALIDATOR}} macro.
> Performing few verification steps mentioned above should help to avoid 
> situations when Kudu tablet servers cannot be registered with Kudu master if 
> the location assignment executable path is misspelled or the executable 
> behaves not as expected.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-2781) Hardening for location awareness command-line flag

2020-06-03 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-2781:

Labels: observability supportability  (was: )

> Hardening for location awareness command-line flag
> --
>
> Key: KUDU-2781
> URL: https://issues.apache.org/jira/browse/KUDU-2781
> Project: Kudu
>  Issue Type: Improvement
>  Components: master
>Reporter: Alexey Serbin
>Assignee: Alexey Serbin
>Priority: Major
>  Labels: observability, supportability
>
> Add few verification steps related to the location assignment:
> * the location assignment executable is present and executable
> * the location assignment executable conforms with the expected interface: it 
> accepts one argument (IP address or DNS name) and outputs the assigned 
> location into the stdout
> * the same DNS name/IP address assigned the same location
> * the result location output into the stdout conforms with the format for 
> locations in Kudu
> It's possible to implement these in {{kudu-master}}  using group flag 
> validators: see the {{GROUP_FLAG_VALIDATOR}} macro.
> Performing few verification steps mentioned above should help to avoid 
> situations when Kudu tablet servers cannot be registered with Kudu master if 
> the location assignment executable path is misspelled or the executable 
> behaves not as expected.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (KUDU-2169) Allow replicas that do not exist to vote

2020-06-02 Thread Alexey Serbin (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-2169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17124314#comment-17124314
 ] 

Alexey Serbin edited comment on KUDU-2169 at 6/2/20, 9:08 PM:
--

Now we have the 3-4-3 replica management scheme, and we don't use the 3-2-3 
scheme anymore.

With the 3-4-3 scheme there are scenarios where the system first evicts a 
replica, and then adds a new non-voter replica: that's when the replica to be 
evicted fails behind WAL segment GC threshold or experience a disk failure.  In 
very rare cases it might happen that a tablet ends up with leader replica A not 
being able to replicate/commit the change in the Raft configuration, as 
described.

>From the other side, such a newly added replica D in case of the 3-4-3 scheme 
>is a non-voter, and it cannot vote by definition.

In other words, some manual intervention would be necessary in the described 
scenario, but not the way how this JIRA proposes it to be implemented.

Closing as 'Won't Do'.


was (Author: aserbin):
Now we have the 3-4-3 replica management scheme, and we don't use 3-2-3 scheme 
anymore.

With the 3-4-3 scheme there are scenarios where the system first evicts a 
replica, and then adds a new non-voter replica: that's when the replica to be 
evicted fails behind WAL segment GC threshold or experience a disk failure.  In 
very rare cases it might happen that a tablet ends up with leader replica A, 
and replica A cannot replicate/commit the change in the Raft configuration as 
described.

>From the other side, such a newly replica D in case of the 3-4-3 scheme is a 
>non-voter, and it cannot vote by definition.

In other words, some manual intervention would be necessary in the described 
scenario, but not the way how this JIRA proposes.

Closing as 'Won't Do'.

> Allow replicas that do not exist to vote
> 
>
> Key: KUDU-2169
> URL: https://issues.apache.org/jira/browse/KUDU-2169
> Project: Kudu
>  Issue Type: Sub-task
>  Components: consensus
>Reporter: Mike Percy
>Priority: Major
> Fix For: n/a
>
>
> In certain scenarios it is desirable for replicas that do not exist on a 
> tablet server to be able to vote. After the implementation of KUDU-871, 
> tombstoned tablets are now able to vote. However, there are circumstances (at 
> least in a pre- KUDU-1097 world) where voters that do not have a copy of a 
> replica (running or tombstoned) would be needed to vote to ensure 
> availability in certain edge-case failure scenarios.
> The quick justification for why it would be safe for a non-existent replica 
> to vote is that it would be equivalent to a replica that has simply not yet 
> replicated any WAL entries, in which case it would be legal to vote for any 
> candidate. Of course, a candidate would only ask such a replica to vote for 
> it if it believed that replica to be a voter in its config.
> Some additional discussion can be found here: 
> https://github.com/apache/kudu/blob/master/docs/design-docs/raft-tablet-copy.md#should-a-server-be-allowed-to-vote-if-it-does_not_exist-or-is-deleted
> What follows is an example of a scenario where "non-existent" replicas being 
> able to vote would be desired:
> In a 3-2-3 re-replication paradigm, the leader (A) of a 3-replica config \{A, 
> B, C\} evicts one replica (C). Then, the leader (A) adds a new voter (D). 
> Before A is able to replicate this config change to B or D, A is partitioned 
> from a network perspective. However A writes this config change to its local 
> WAL. After this, the entire cluster is brought down, the network is restored, 
> and the entire cluster is restarted. However, B fails to come back online due 
> to a hardware failure.
> The only way to automatically recover in this scenario is to allow D, which 
> has no concept of the tablet being discussed, to vote for A to become leader, 
> which will then tablet copy to D and make the tablet available for writes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KUDU-2169) Allow replicas that do not exist to vote

2020-06-02 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-2169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin resolved KUDU-2169.
-
Fix Version/s: n/a
   Resolution: Won't Do

Now we have the 3-4-3 replica management scheme, and we don't use 3-2-3 scheme 
anymore.

With the 3-4-3 scheme there are scenarios where the system first evicts a 
replica, and then adds a new non-voter replica: that's when the replica to be 
evicted fails behind WAL segment GC threshold or experience a disk failure.  In 
very rare cases it might happen that a tablet ends up with leader replica A, 
and replica A cannot replicate/commit the change in the Raft configuration as 
described.

>From the other side, such a newly replica D in case of the 3-4-3 scheme is a 
>non-voter, and it cannot vote by definition.

In other words, some manual intervention would be necessary in the described 
scenario, but not the way how this JIRA proposes.

Closing as 'Won't Do'.

> Allow replicas that do not exist to vote
> 
>
> Key: KUDU-2169
> URL: https://issues.apache.org/jira/browse/KUDU-2169
> Project: Kudu
>  Issue Type: Sub-task
>  Components: consensus
>Reporter: Mike Percy
>Priority: Major
> Fix For: n/a
>
>
> In certain scenarios it is desirable for replicas that do not exist on a 
> tablet server to be able to vote. After the implementation of KUDU-871, 
> tombstoned tablets are now able to vote. However, there are circumstances (at 
> least in a pre- KUDU-1097 world) where voters that do not have a copy of a 
> replica (running or tombstoned) would be needed to vote to ensure 
> availability in certain edge-case failure scenarios.
> The quick justification for why it would be safe for a non-existent replica 
> to vote is that it would be equivalent to a replica that has simply not yet 
> replicated any WAL entries, in which case it would be legal to vote for any 
> candidate. Of course, a candidate would only ask such a replica to vote for 
> it if it believed that replica to be a voter in its config.
> Some additional discussion can be found here: 
> https://github.com/apache/kudu/blob/master/docs/design-docs/raft-tablet-copy.md#should-a-server-be-allowed-to-vote-if-it-does_not_exist-or-is-deleted
> What follows is an example of a scenario where "non-existent" replicas being 
> able to vote would be desired:
> In a 3-2-3 re-replication paradigm, the leader (A) of a 3-replica config \{A, 
> B, C\} evicts one replica (C). Then, the leader (A) adds a new voter (D). 
> Before A is able to replicate this config change to B or D, A is partitioned 
> from a network perspective. However A writes this config change to its local 
> WAL. After this, the entire cluster is brought down, the network is restored, 
> and the entire cluster is restarted. However, B fails to come back online due 
> to a hardware failure.
> The only way to automatically recover in this scenario is to allow D, which 
> has no concept of the tablet being discussed, to vote for A to become leader, 
> which will then tablet copy to D and make the tablet available for writes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KUDU-1621) Flush lingering operations upon destruction of an AUTO_FLUSH_BACKGROUND session

2020-06-02 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin resolved KUDU-1621.
-
Fix Version/s: n/a
   Resolution: Won't Fix

Automatically flushing data in the {{KuduSession}} might block, indeed.  It 
seems the current approach of issuing a warning when data is not flushed is 
good enough: it's uniform across all flush modes and avoids application hanging 
on close.

> Flush lingering operations upon destruction of an AUTO_FLUSH_BACKGROUND 
> session
> ---
>
> Key: KUDU-1621
> URL: https://issues.apache.org/jira/browse/KUDU-1621
> Project: Kudu
>  Issue Type: Improvement
>  Components: client
>Affects Versions: 1.0.0
>Reporter: Alexey Serbin
>Priority: Major
> Fix For: n/a
>
>
> In current implementation of AUTO_FLUSH_BACKGROUND mode, it's necessary to 
> call KuduSession::Flush() or KuduSession::FlushAsync() explicitly before 
> destroying/abandoning a session if it's desired to have any pending 
> operations flushed.
> As [~adar] noticed during review of https://gerrit.cloudera.org/#/c/4432/ , 
> it might make sense to change this behavior to automatically flush any 
> pending operations upon closing Kudu AUTO_FLUSH_BACKGROUND session.  That 
> would be more consistent with the semantics of the AUTO_FLUSH_BACKGROUND mode 
> and more user-friendly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-3131) test rw_mutex-test hangs sometimes if build_type is release

2020-05-31 Thread Alexey Serbin (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17120608#comment-17120608
 ] 

Alexey Serbin commented on KUDU-3131:
-

Hi [~RenhaiZhao], at the server where I do a lot of compilation/testing the 
version of glibc is {{2.12-1.149.el6_6.9}}.  It's a really old installation: 
CentOS6.6

> test rw_mutex-test hangs sometimes if build_type is release
> ---
>
> Key: KUDU-3131
> URL: https://issues.apache.org/jira/browse/KUDU-3131
> Project: Kudu
>  Issue Type: Sub-task
>Reporter: huangtianhua
>Priority: Major
>
> Built and test kudu on aarch64, in release mode there is a test hangs 
> sometimes(maybe a deadlock?) the console out as following:
> [==] Running 2 tests from 1 test case.
> [--] Global test environment set-up.
> [--] 2 tests from Priorities/RWMutexTest
> [ RUN  ] Priorities/RWMutexTest.TestDeadlocks/0
> And seems it's ok in debug mode.
> Now only this one test failed sometimes on aarch64, [~aserbin] [~adar] would 
> you please have a look for this? Or give some suggestion to us, thanks very 
> much.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (KUDU-368) Run local benchmarks under perf-stat

2020-05-28 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin reassigned KUDU-368:
--

Assignee: Alexey Serbin

> Run local benchmarks under perf-stat
> 
>
> Key: KUDU-368
> URL: https://issues.apache.org/jira/browse/KUDU-368
> Project: Kudu
>  Issue Type: Improvement
>  Components: test
>Affects Versions: M4.5
>Reporter: Todd Lipcon
>Assignee: Alexey Serbin
>Priority: Minor
>  Labels: benchmarks, perf
>
> Would be nice to run a lot of our nightly benchmarks under perf-stat so we 
> can see on regression what factors changed (eg instruction count, cycles, 
> stalled cycles, cache misses, etc)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-2604) Add label for tserver

2020-05-28 Thread Alexey Serbin (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-2604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17118788#comment-17118788
 ] 

Alexey Serbin commented on KUDU-2604:
-

[~granthenke], yes, I think the remaining functionality can be broken down into 
smaller JIRA items.  At the higher level, I see the following pieces:
* Define and assign tags to tablet servers.
* Update master's placement policies to take into account tags when 
adding/distributing replicas of tablets.
* Add support for C++ and Java clients: clients can specify set of tags when 
creating tables.
* The {{kudu cluster rebalance}} tool and the auto-rebalancer honors the tags 
when rebalancing corresponding tables.  The tools is also able to report on 
tablet replicas which are placed in a non-conforming way w.r.t. tags specified 
for tables (those con-conformant-placed replicas might appear during automatic 
re-replication: this is something similar that we have with current placement 
policies).
* The {{kudu cluster ksck}} CLI tool provides information on tags for tablet 
servers.

We can create sub-tasks for this if we decide to implement this.

> Add label for tserver
> -
>
> Key: KUDU-2604
> URL: https://issues.apache.org/jira/browse/KUDU-2604
> Project: Kudu
>  Issue Type: New Feature
>Reporter: Hong Shen
>Priority: Major
>  Labels: location-awareness, rack-awareness
> Fix For: n/a
>
> Attachments: image-2018-10-15-21-52-21-426.png
>
>
> When the cluster is bigger and bigger, big table with a lot of tablets will 
> be distributed in almost all the tservers, when client write batch to the big 
> table, it may cache connections to lots of tservers, the scalability may 
> constrained.
> If the tablets in one table or partition only in a part of tservers, client 
> will only have to cache connections to the part's tservers. So we propose to 
> add label to tservers, each tserver belongs to a unique label. Client 
> specified label when create table or add partition, the tablets will only be 
> created on the tservers in specified label, if not specified, defalut label 
> will be used. 
>  It will also benefit for:
> 1 Tserver across data center.
> 2 Heterogeneous tserver, like different disk, cpu or memory.
> 3 Physical isolating, especially IO, isolate some tables with others.
> 4 Gated Launch, upgrade tservers one by one label.
> In our product cluster, we have encounter the above issues and need to be 
> resolved.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (KUDU-2604) Add label for tserver

2020-05-28 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-2604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin reopened KUDU-2604:
-

It seems this JIRA item contains some useful ideas and details which are 
orthogonal to current implementation of the rack awareness feature.  If 
implemented, they might complement the overall functionality of the placement 
policies in Kudu.  I'm removing the duplicate of KUDU-1535 resolution.

> Add label for tserver
> -
>
> Key: KUDU-2604
> URL: https://issues.apache.org/jira/browse/KUDU-2604
> Project: Kudu
>  Issue Type: New Feature
>Reporter: Hong Shen
>Priority: Major
>  Labels: location-awareness, rack-awareness
> Fix For: n/a
>
> Attachments: image-2018-10-15-21-52-21-426.png
>
>
> When the cluster is bigger and bigger, big table with a lot of tablets will 
> be distributed in almost all the tservers, when client write batch to the big 
> table, it may cache connections to lots of tservers, the scalability may 
> constrained.
> If the tablets in one table or partition only in a part of tservers, client 
> will only have to cache connections to the part's tservers. So we propose to 
> add label to tservers, each tserver belongs to a unique label. Client 
> specified label when create table or add partition, the tablets will only be 
> created on the tservers in specified label, if not specified, defalut label 
> will be used. 
>  It will also benefit for:
> 1 Tserver across data center.
> 2 Heterogeneous tserver, like different disk, cpu or memory.
> 3 Physical isolating, especially IO, isolate some tables with others.
> 4 Gated Launch, upgrade tservers one by one label.
> In our product cluster, we have encounter the above issues and need to be 
> resolved.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-1865) Create fast path for RespondSuccess() in KRPC

2020-05-26 Thread Alexey Serbin (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17117387#comment-17117387
 ] 

Alexey Serbin commented on KUDU-1865:
-

Some more stacks captured from diagnostic logs for {{kudu-master}} process 
(kudu 1.10):

{noformat}
Stacks at 0516 18:53:00.042003 (service queue overflowed for 
kudu.master.MasterService):
  tids=[736230]
  0x7f803a76a5e0 
0xb6219e tcmalloc::ThreadCache::ReleaseToCentralCache()
0xb62530 tcmalloc::ThreadCache::Scavenge()
0xad8a27 
kudu::master::CatalogManager::ScopedLeaderSharedLock::ScopedLeaderSharedLock()
0xaa3a31 kudu::master::MasterServiceImpl::GetTableSchema()
   0x221aaa9 kudu::rpc::GeneratedServiceIf::Handle()
   0x221b1a9 kudu::rpc::ServicePool::RunThread()
   0x23a8f84 kudu::Thread::SuperviseThread()
  0x7f803a762e25 start_thread
  0x7f8038a4134d __clone
  tids=[736248,736245,736243,736242]
  0x7f803a76a5e0 
   0x23c5b44 base::internal::SpinLockDelay()
   0x23c59cc base::SpinLock::SlowLock()
0xac5814 kudu::master::CatalogManager::CheckOnline()
0xae5032 kudu::master::CatalogManager::GetTableSchema()
0xaa3a85 kudu::master::MasterServiceImpl::GetTableSchema()
   0x221aaa9 kudu::rpc::GeneratedServiceIf::Handle()
   0x221b1a9 kudu::rpc::ServicePool::RunThread()
   0x23a8f84 kudu::Thread::SuperviseThread()
  0x7f803a762e25 start_thread
  0x7f8038a4134d __clone
  
tids=[736239,736229,736232,736233,736234,736235,736236,736237,736238,736240,736241,736244,736247]
  0x7f803a76a5e0 
   0x23c5b44 base::internal::SpinLockDelay()
   0x23c59cc base::SpinLock::SlowLock()
0xac5814 kudu::master::CatalogManager::CheckOnline()
0xaf102f kudu::master::CatalogManager::GetTableLocations()
0xaa36f8 kudu::master::MasterServiceImpl::GetTableLocations()
   0x221aaa9 kudu::rpc::GeneratedServiceIf::Handle()
   0x221b1a9 kudu::rpc::ServicePool::RunThread()
   0x23a8f84 kudu::Thread::SuperviseThread()
  0x7f803a762e25 start_thread
  0x7f8038a4134d __clone
  tids=[736246,736231]
  0x7f803a76a5e0 
   0x23c5b44 base::internal::SpinLockDelay()
   0x23c59cc base::SpinLock::SlowLock()
0xad8b7c 
kudu::master::CatalogManager::ScopedLeaderSharedLock::ScopedLeaderSharedLock()
0xaa369d kudu::master::MasterServiceImpl::GetTableLocations()
   0x221aaa9 kudu::rpc::GeneratedServiceIf::Handle()
   0x221b1a9 kudu::rpc::ServicePool::RunThread()
   0x23a8f84 kudu::Thread::SuperviseThread()
  0x7f803a762e25 start_thread
  0x7f8038a4134d __clone
{noformat}

> Create fast path for RespondSuccess() in KRPC
> -
>
> Key: KUDU-1865
> URL: https://issues.apache.org/jira/browse/KUDU-1865
> Project: Kudu
>  Issue Type: Improvement
>  Components: rpc
>Reporter: Sailesh Mukil
>Priority: Major
>  Labels: perfomance, rpc
> Attachments: alloc-pattern.py, cross-thread.txt
>
>
> A lot of RPCs just respond with RespondSuccess() which returns the exact 
> payload every time. This takes the same path as any other response by 
> ultimately calling Connection::QueueResponseForCall() which has a few small 
> allocations. These small allocations (and their corresponding deallocations) 
> are called quite frequently (once for every IncomingCall) and end up taking 
> quite some time in the kernel (traversing the free list, spin locks etc.)
> This was found when [~mmokhtar] ran some profiles on Impala over KRPC on a 20 
> node cluster and found the following:
> The exact % of time spent is hard to quantify from the profiles, but these 
> were the among the top 5 of the slowest stacks:
> {code:java}
> impalad ! tcmalloc::CentralFreeList::ReleaseToSpans - [unknown source file]
> impalad ! tcmalloc::CentralFreeList::ReleaseListToSpans + 0x1a - [unknown 
> source file]
> impalad ! tcmalloc::CentralFreeList::InsertRange + 0x3b - [unknown source 
> file]
> impalad ! tcmalloc::ThreadCache::ReleaseToCentralCache + 0x103 - [unknown 
> source file]
> impalad ! tcmalloc::ThreadCache::Scavenge + 0x3e - [unknown source file]
> impalad ! operator delete + 0x329 - [unknown source file]
> impalad ! __gnu_cxx::new_allocator::deallocate + 0x4 - 
> new_allocator.h:110
> impalad ! std::_Vector_base std::allocator>::_M_deallocate + 0x5 - stl_vector.h:178
> impalad ! ~_Vector_base + 0x4 - stl_vector.h:160
> impalad ! ~vector - stl_vector.h:425    'slices' vector
> impalad ! kudu::rpc::Connection::QueueResponseForCall + 0xac - 
> connection.cc:433
> impalad ! kudu::rpc::InboundCall::Respond + 0xfa - inbound_call.cc:133
> impalad ! kud

[jira] [Commented] (KUDU-3131) test rw_mutex-test hangs sometimes if build_type is release

2020-05-26 Thread Alexey Serbin (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17117008#comment-17117008
 ] 

Alexey Serbin commented on KUDU-3131:
-

I cannot reproduce this on x86_64 architecture and I don't have access to 
aarch64 at this point.

I'd try to attach to the hung process with a debugger and see what's going on.  
 [~huangtianhua], did you have a chance to try that?

> test rw_mutex-test hangs sometimes if build_type is release
> ---
>
> Key: KUDU-3131
> URL: https://issues.apache.org/jira/browse/KUDU-3131
> Project: Kudu
>  Issue Type: Sub-task
>Reporter: huangtianhua
>Priority: Major
>
> Built and test kudu on aarch64, in release mode there is a test hangs 
> sometimes(maybe a deadlock?) the console out as following:
> [==] Running 2 tests from 1 test case.
> [--] Global test environment set-up.
> [--] 2 tests from Priorities/RWMutexTest
> [ RUN  ] Priorities/RWMutexTest.TestDeadlocks/0
> And seems it's ok in debug mode.
> Now only this one test failed sometimes on aarch64, [~aserbin] [~adar] would 
> you please have a look for this? Or give some suggestion to us, thanks very 
> much.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KUDU-3107) TestRpc.TestCancellationMultiThreads fail on ARM sometimes due to service queue is full

2020-05-20 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin resolved KUDU-3107.
-
Fix Version/s: NA
   Resolution: Cannot Reproduce

> TestRpc.TestCancellationMultiThreads fail on ARM sometimes due to service 
> queue is full
> ---
>
> Key: KUDU-3107
> URL: https://issues.apache.org/jira/browse/KUDU-3107
> Project: Kudu
>  Issue Type: Sub-task
>Reporter: liusheng
>Priority: Major
> Fix For: NA
>
> Attachments: rpc-test.txt
>
>
> The test TestRpc.TestCancellationMultiThreads fail sometimes on ARM mechine 
> due the the "service queue full" error. related  error message:
> {code:java}
> Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 
> (request call id 318)
> Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 
> (request call id 319)
> Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 
> (request call id 320)
> Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 
> (request call id 321)
> Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 
> (request call id 324)
> Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 
> (request call id 332)
> Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 
> (request call id 334)
> Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 
> (request call id 335)
> Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 
> (request call id 336)
> Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 
> (request call id 337)
> Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 
> (request call id 338)
> Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 
> (request call id 339)
> Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 
> (request call id 340)
> Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 
> (request call id 341)
> F0416 13:01:38.616358 31937 rpc-test.cc:1471] Check failed: 
> controller.status().IsAborted() || controller.status().IsServiceUnavailable() 
> || controller.status().ok() Remote error: Service unavailable: PushStrings 
> request on kudu.rpc.GenericCalculatorService from 127.0.0.1:41516 dropped due 
> to backpressure. The service queue is full; it has 100 items.
> *** Check failure stack trace: ***
> PC: @0x0 (unknown)
> *** SIGABRT (@0x3e86bbf) received by PID 27583 (TID 0x84b1f050) from 
> PID 27583; stack trace: ***
> @ 0x93cf0464 raise at ??:0
> @ 0x93cf18b4 abort at ??:0
> @ 0x942c5fdc google::logging_fail() at ??:0
> @ 0x942c7d40 google::LogMessage::Fail() at ??:0
> @ 0x942c9c78 google::LogMessage::SendToLog() at ??:0
> @ 0x942c7874 google::LogMessage::Flush() at ??:0
> @ 0x942ca4fc google::LogMessageFatal::~LogMessageFatal() at ??:0
> @ 0xdcee4940 kudu::rpc::SendAndCancelRpcs() at ??:0
> @ 0xdcee4b98 
> _ZZN4kudu3rpc41TestRpc_TestCancellationMultiThreads_Test8TestBodyEvENKUlvE_clEv
>  at ??:0
> @ 0xdcee76bc 
> _ZSt13__invoke_implIvZN4kudu3rpc41TestRpc_TestCancellationMultiThreads_Test8TestBodyEvEUlvE_JEET_St14__invoke_otherOT0_DpOT1_
>  at ??:0
> @ 0xdcee7484 
> _ZSt8__invokeIZN4kudu3rpc41TestRpc_TestCancellationMultiThreads_Test8TestBodyEvEUlvE_JEENSt15__invoke_resultIT_JDpT0_EE4typeEOS5_DpOS6_
>  at ??:0
> @ 0xdcee8208 
> _ZNSt6thread8_InvokerISt5tupleIJZN4kudu3rpc41TestRpc_TestCancellationMultiThreads_Test8TestBodyEvEUlvE_EEE9_M_invokeIJLm0DTcl8__invokespcl10_S_declvalIXT_ESt12_Index_tupleIJXspT_EEE
>  at ??:0
> @ 0xdcee8168 
> _ZNSt6thread8_InvokerISt5tupleIJZN4kudu3rpc41TestRpc_TestCancellationMultiThreads_Test8TestBodyEvEUlvE_EEEclEv
>  at ??:0
> @ 0xdcee8110 
> _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN4kudu3rpc41TestRpc_TestCancellationMultiThreads_Test8TestBodyEvEUlvE_E6_M_runEv
>  at ??:0
> @ 0x93f22e94 (unknown) at ??:0
> @ 0x93e1e088 start_thread at ??:0
> @ 0x93d8e4ec (unknown) at ??:0
> {code}
> The attatchment is the full test log



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-2453) kudu should stop creating tablet infinitely

2020-05-18 Thread Alexey Serbin (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-2453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110649#comment-17110649
 ] 

Alexey Serbin commented on KUDU-2453:
-

There is a reproduction scenario for the issue described in this JIRA: 
https://gerrit.cloudera.org/#/c/15912/

> kudu should stop creating tablet infinitely
> ---
>
> Key: KUDU-2453
> URL: https://issues.apache.org/jira/browse/KUDU-2453
> Project: Kudu
>  Issue Type: Bug
>  Components: master, tserver
>Affects Versions: 1.4.0, 1.7.2
>Reporter: LiFu He
>Priority: Major
>
> I have met this problem again on 2018/10/26. And now the kudu version is 
> 1.7.2.
> -
> We modified the flag 'max_create_tablets_per_ts' (2000) of master.conf, and 
> there are some load on the kudu cluster. Then someone else created a big 
> table which had tens of thousands of tablets from impala-shell (that was a 
> mistake). 
> {code:java}
> CREATE TABLE XXX(
> ...
>PRIMARY KEY (...)
> )
> PARTITION BY HASH (...) PARTITIONS 100,
> RANGE (...)
> (
>   PARTITION "2018-10-24" <= VALUES < "2018-10-24\000",
>   PARTITION "2018-10-25" <= VALUES < "2018-10-25\000",
>   ...
>   PARTITION "2018-12-07" <= VALUES < "2018-12-07\000"
> )
> STORED AS KUDU
> TBLPROPERTIES ('kudu.master_addresses'= '...');
> {code}
> Here are the logs after creating table (only pick one tablet as example):
> {code:java}
> --Kudu-master log
> ==e884bda6bbd3482f94c07ca0f34f99a4==
> W1024 11:40:51.914397 180146 catalog_manager.cc:2664] TS 
> 39f15fcf42ef45bba0c95a3223dc25ee (kudu2.lt.163.org:7050): Create Tablet RPC 
> failed for tablet e884bda6bbd3482f94c07ca0f34f99a4: Remote error: Service 
> unavailable: CreateTablet request on kudu.tserver.TabletServerAdminService 
> from 10.120.219.118:50247 dropped due to backpressure. The service queue is 
> full; it has 512 items.
> I1024 11:40:51.914412 180146 catalog_manager.cc:2700] Scheduling retry of 
> CreateTablet RPC for tablet e884bda6bbd3482f94c07ca0f34f99a4 on TS 
> 39f15fcf42ef45bba0c95a3223dc25ee with a delay of 42 ms (attempt = 1)
> ...
> ==Be replaced by 0b144c00f35d48cca4d4981698faef72==
> W1024 11:41:22.114512 180202 catalog_manager.cc:3949] T 
>  P f6c9a09da7ef4fc191cab6276b942ba3: Tablet 
> e884bda6bbd3482f94c07ca0f34f99a4 (table quasi_realtime_user_feature 
> [id=946d6dd03ec544eab96231e5a03bed59]) was not created within the allowed 
> timeout. Replacing with a new tablet 0b144c00f35d48cca4d4981698faef72
> ...
> I1024 11:41:22.391916 180202 catalog_manager.cc:3806] T 
>  P f6c9a09da7ef4fc191cab6276b942ba3: Sending 
> DeleteTablet for 3 replicas of tablet e884bda6bbd3482f94c07ca0f34f99a4
> ...
> I1024 11:41:22.391927 180202 catalog_manager.cc:2922] Sending 
> DeleteTablet(TABLET_DATA_DELETED) for tablet e884bda6bbd3482f94c07ca0f34f99a4 
> on 39f15fcf42ef45bba0c95a3223dc25ee (kudu2.lt.163.org:7050) (Replaced by 
> 0b144c00f35d48cca4d4981698faef72 at 2018-10-24 11:41:22 CST)
> ...
> W1024 11:41:22.428129 180146 catalog_manager.cc:2892] TS 
> 39f15fcf42ef45bba0c95a3223dc25ee (kudu2.lt.163.org:7050): delete failed for 
> tablet e884bda6bbd3482f94c07ca0f34f99a4 with error code TABLET_NOT_RUNNING: 
> Already present: State transition of tablet e884bda6bbd3482f94c07ca0f34f99a4 
> already in progress: creating tablet
> ...
> I1024 11:41:22.428143 180146 catalog_manager.cc:2700] Scheduling retry of 
> e884bda6bbd3482f94c07ca0f34f99a4 Delete Tablet RPC for 
> TS=39f15fcf42ef45bba0c95a3223dc25ee with a delay of 35 ms (attempt = 1)
> ...
> W1024 11:41:22.683702 180145 catalog_manager.cc:2664] TS 
> b251540e606b4863bb576091ff961892 (kudu1.lt.163.org:7050): Create Tablet RPC 
> failed for tablet 0b144c00f35d48cca4d4981698faef72: Remote error: Service 
> unavailable: CreateTablet request on kudu.tserver.TabletServerAdminService 
> from 10.120.219.118:59735 dropped due to backpressure. The service queue is 
> full; it has 512 items.
> I1024 11:41:22.683717 180145 catalog_manager.cc:2700] Scheduling retry of 
> CreateTablet RPC for tablet 0b144c00f35d48cca4d4981698faef72 on TS 
> b251540e606b4863bb576091ff961892 with a delay of 46 ms (attempt = 1)
> ...
> ==Be replaced by c0e0acc448fc42fc9e48f5025b112a75==
> W1024 11:41:52.775420 180202 catalog_manager.cc:3949] T 
>  P f6c9a09da7ef4fc191cab6276b942ba3: Tablet 
> 0b144c00f35d48cca4d4981698faef72 (table quasi_realtime_user_feature 
> [id=946d6dd03ec544eab96231e5a03bed59]) was not created within the allowed 
> timeout. Replacing with a new tablet c0e0acc448fc42fc9e48f5025b112a75
> ...
> --Kudu-tserver log
> I1024 11:40:52.014571 13

[jira] [Created] (KUDU-3124) A safer way to handle CreateTablet requests

2020-05-18 Thread Alexey Serbin (Jira)
Alexey Serbin created KUDU-3124:
---

 Summary: A safer way to handle CreateTablet requests
 Key: KUDU-3124
 URL: https://issues.apache.org/jira/browse/KUDU-3124
 Project: Kudu
  Issue Type: Improvement
  Components: master, tserver
Affects Versions: 1.11.1, 1.11.0, 1.10.1, 1.10.0, 1.9.0, 1.7.1, 1.8.0, 
1.7.0, 1.6.0, 1.5.0, 1.4.0, 1.3.1, 1.3.0, 1.2.0
Reporter: Alexey Serbin


As of now, catalog manager (a part of kudu-master) sends 
{{CreateTabletRequest}} RPC
as soon as they are realized by {{CatalogManager::ProcessPendingAssignments()}}
when processing the list of deferred DDL operations, and at this level there
isn't any restrictions on how many of those might be in flight or sent to
a particular tablet server (NOTE: there is {{\-\-max_create_tablets_per_ts}} 
flag,
but it works on a higher level and only during initial creation of a table).

The {{CreateTablet}} requests are sent asynchronously, and if the tablet isn't
created within {{\-\-tablet_creation_timeout_ms|| milliseconds, catalog manager
replaces all the tablet replicas, generating a new tablet UUID and sending
corresponding {{CreateTabletRequest}} RPCs to a potentially different set of 
tablet
servers.  Corresponding {{DeleteTabletRequest}} RPCs (to remove the replicas of 
the
stalled-during-creation tablet) are sent separately in an asynchronous way
as well.

There are at least two issues with this approach:

# The {{\-\-max_create_tablets_per_ts}} threshold limits the number of 
concurrent requests hitting one tablet server only during the initial creation 
of a table. However, nothing limits how many requests to create a table replica 
might hit a tablet server when adding partitions to an existing table as a 
result of ALTER TABLE request.
# {{DeleteTabletRequest}} RPCs sometimes might not get into the RPC queues of
corresponding tablet servers, and catalog manager stops retrying sending those
after {{\-\-unresponsive_ts_rpc_timeout_ms}} interval.  This might spiral into 
a situation when requests to create replacement tablet replicas are passing 
through and executed by tablet servers, but corresponding requests to delete 
tablet replica cannot get through because of queue overflows, with catalog 
manager eventually giving up retrying the latter ones.  Eventually, tablet 
servers end up with huge number of tablet replicas created, and they crash 
running out of memory.  The crashed tablet servers cannot start after that 
because they eventually run out of memory trying to bootstrap the huge number 
of tablet replicas (running out of memory again). See 
https://gerrit.cloudera.org/#/c/15912/ for the reproduction scenario and 
[KUDU-2453|https://issues.apache.org/jira/browse/KUDU-2453] for corresponding 
issue reported some time ago. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-3000) RemoteKsckTest.TestChecksumSnapshot sometimes fails

2020-05-11 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3000:

Attachment: ksck_remote-test.01.txt.xz

> RemoteKsckTest.TestChecksumSnapshot sometimes fails
> ---
>
> Key: KUDU-3000
> URL: https://issues.apache.org/jira/browse/KUDU-3000
> Project: Kudu
>  Issue Type: Bug
>  Components: test
>Affects Versions: 1.10.0, 1.10.1, 1.11.0
>Reporter: Alexey Serbin
>Priority: Major
> Attachments: ksck_remote-test.01.txt.xz, ksck_remote-test.txt.xz
>
>
> The {{TestChecksumSnapshot}} scenario of the {{RemoteKsckTest}} test 
> sometimes fails with the following error message:
> {noformat}
> W1116 06:46:18.593114  3904 tablet_service.cc:2365] Rejecting scan request 
> for tablet 4ce9988aac744b
> 1bbde2772c66cce35d: Uninitialized: safe time has not yet been initialized
> src/kudu/tools/ksck_remote-test.cc:407: Failure
> Failed
> {noformat}
> Full log is attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-3000) RemoteKsckTest.TestChecksumSnapshot sometimes fails

2020-05-11 Thread Alexey Serbin (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17105098#comment-17105098
 ] 

Alexey Serbin commented on KUDU-3000:
-

Another failure (probably, this time the root cause different):

{noformat}
/data0/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/tools/ksck_remote-test.cc:407:
 Failure
Failed
Bad status: Aborted: 1 errors were detected
{noformat}

The log is attached. [^ksck_remote-test.01.txt.xz] 

> RemoteKsckTest.TestChecksumSnapshot sometimes fails
> ---
>
> Key: KUDU-3000
> URL: https://issues.apache.org/jira/browse/KUDU-3000
> Project: Kudu
>  Issue Type: Bug
>  Components: test
>Affects Versions: 1.10.0, 1.10.1, 1.11.0
>Reporter: Alexey Serbin
>Priority: Major
> Attachments: ksck_remote-test.01.txt.xz, ksck_remote-test.txt.xz
>
>
> The {{TestChecksumSnapshot}} scenario of the {{RemoteKsckTest}} test 
> sometimes fails with the following error message:
> {noformat}
> W1116 06:46:18.593114  3904 tablet_service.cc:2365] Rejecting scan request 
> for tablet 4ce9988aac744b
> 1bbde2772c66cce35d: Uninitialized: safe time has not yet been initialized
> src/kudu/tools/ksck_remote-test.cc:407: Failure
> Failed
> {noformat}
> Full log is attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KUDU-3120) testHiveMetastoreIntegration(org.apache.kudu.test.TestMiniKuduCluster) sometimes fails with timeout

2020-05-11 Thread Alexey Serbin (Jira)
Alexey Serbin created KUDU-3120:
---

 Summary: 
testHiveMetastoreIntegration(org.apache.kudu.test.TestMiniKuduCluster) 
sometimes fails with timeout
 Key: KUDU-3120
 URL: https://issues.apache.org/jira/browse/KUDU-3120
 Project: Kudu
  Issue Type: Bug
  Components: test
Reporter: Alexey Serbin
 Attachments: test-output.txt.xz

The subj sometimes fails due to timeout:

{noformat}
Time: 56.114
There was 1 failure:
1) testHiveMetastoreIntegration(org.apache.kudu.test.TestMiniKuduCluster)
org.junit.runners.model.TestTimedOutException: test timed out after 5 
milliseconds
at java.io.FileInputStream.readBytes(Native Method)
at java.io.FileInputStream.read(FileInputStream.java:255)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
at java.io.DataInputStream.readInt(DataInputStream.java:387)
at 
org.apache.kudu.test.cluster.MiniKuduCluster.sendRequestToCluster(MiniKuduCluster.java:162)
at 
org.apache.kudu.test.cluster.MiniKuduCluster.start(MiniKuduCluster.java:235)
at 
org.apache.kudu.test.cluster.MiniKuduCluster.access$300(MiniKuduCluster.java:72)
at 
org.apache.kudu.test.cluster.MiniKuduCluster$MiniKuduClusterBuilder.build(MiniKuduCluster.java:697)
at 
org.apache.kudu.test.TestMiniKuduCluster.testHiveMetastoreIntegration(TestMiniKuduCluster.java:106)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at 
org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
at 
org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.lang.Thread.run(Thread.java:748)
{noformat}

The log is attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KUDU-3119) ToolTest.TestFsAddRemoveDataDirEndToEnd reports race under TSAN

2020-05-11 Thread Alexey Serbin (Jira)
Alexey Serbin created KUDU-3119:
---

 Summary: ToolTest.TestFsAddRemoveDataDirEndToEnd reports race 
under TSAN
 Key: KUDU-3119
 URL: https://issues.apache.org/jira/browse/KUDU-3119
 Project: Kudu
  Issue Type: Bug
  Components: CLI, test
Reporter: Alexey Serbin
 Attachments: kudu-tool-test.log.xz

Sometimes the {{TestFsAddRemoveDataDirEndToEnd}} scenario of the {{ToolTest}} 
reports races for TSAN builds:

{noformat}
/data0/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/tools/kudu-tool-test.cc:266:
 Failure
Failed
Bad status: Runtime error: /tmp/dist-test-taskIZqSmU/build/tsan/bin/kudu: 
process exited with non-ze
ro status 66
Google Test trace:
/data0/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/tools/kudu-tool-test.cc:265:
 W0506 17:5
6:02.744191  4432 flags.cc:404] Enabled unsafe flag: --never_fsync=true
I0506 17:56:02.780252  4432 fs_manager.cc:263] Metadata directory not provided
I0506 17:56:02.780442  4432 fs_manager.cc:269] Using write-ahead log directory 
(fs_wal_dir) as metad
ata directory
I0506 17:56:02.789638  4432 fs_manager.cc:399] Time spent opening directory 
manager: real 0.007s
user 0.005s sys 0.002s
I0506 17:56:02.789986  4432 env_posix.cc:1676] Not raising this process' open 
files per process limi
t of 1048576; it is already as high as it can go
I0506 17:56:02.790426  4432 file_cache.cc:465] Constructed file cache lbm with 
capacity 419430
==
WARNING: ThreadSanitizer: data race (pid=4432)
...
{noformat}

The log is attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-3117) TServerQuiescingITest.TestMajorityQuiescingElectsLeader sometimes fails

2020-05-11 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3117:

Description: 
The {{TServerQuiescingITest.TestMajorityQuiescingElectsLeader}} test scenario 
sometimes fails (TSAN builds) with the messages like below:

{noformat}
kudu/integration-tests/tablet_server_quiescing-itest.cc:228: Failure
Failed  
Bad status: Timed out: Unable to find leader of tablet 
65dba49fff3f45678638414af2fe83e4 after 10.005s. Status message: Not found: 
Error connecting to replica: Timed out: GetConsensusState RPC to 
127.0.177.65:42397 timed out after -0.003s (SENT)
kudu/util/test_util.cc:349: Failure
Failed
{noformat}

The log is attached.

  was:
The {{TServerQuiescingITest.TestMajorityQuiescingElectsLeader}} test scenario 
sometimes fails (TSAN builds) with the messages like below:

{noformat}
kudu/integration-tests/tablet_server_quiescing-itest.cc:228: Failure
Failed  
Bad status: Timed out: Unable to find leader of tablet 
65dba49fff3f45678638414af2fe83e4 after 10.005s. Status message: Not found: 
Error connecting to replica: Timed out: GetConsensusState RPC to 
127.0.177.65:42397 timed out after -0.003s (SENT)
kudu/util/test_util.cc:349: Failure
Failed
{noformat}


> TServerQuiescingITest.TestMajorityQuiescingElectsLeader sometimes fails
> ---
>
> Key: KUDU-3117
> URL: https://issues.apache.org/jira/browse/KUDU-3117
> Project: Kudu
>  Issue Type: Bug
>  Components: test
>Affects Versions: 1.12.0
>Reporter: Alexey Serbin
>Priority: Minor
> Attachments: tablet_server_quiescing-itest.txt.xz
>
>
> The {{TServerQuiescingITest.TestMajorityQuiescingElectsLeader}} test scenario 
> sometimes fails (TSAN builds) with the messages like below:
> {noformat}
> kudu/integration-tests/tablet_server_quiescing-itest.cc:228: Failure
> Failed
>   
> Bad status: Timed out: Unable to find leader of tablet 
> 65dba49fff3f45678638414af2fe83e4 after 10.005s. Status message: Not found: 
> Error connecting to replica: Timed out: GetConsensusState RPC to 
> 127.0.177.65:42397 timed out after -0.003s (SENT)
> kudu/util/test_util.cc:349: Failure
> Failed
> {noformat}
> The log is attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-3117) TServerQuiescingITest.TestMajorityQuiescingElectsLeader sometimes fails

2020-05-11 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3117:

Attachment: tablet_server_quiescing-itest.txt.xz

> TServerQuiescingITest.TestMajorityQuiescingElectsLeader sometimes fails
> ---
>
> Key: KUDU-3117
> URL: https://issues.apache.org/jira/browse/KUDU-3117
> Project: Kudu
>  Issue Type: Bug
>  Components: test
>Affects Versions: 1.12.0
>Reporter: Alexey Serbin
>Priority: Minor
> Attachments: tablet_server_quiescing-itest.txt.xz
>
>
> The {{TServerQuiescingITest.TestMajorityQuiescingElectsLeader}} test scenario 
> sometimes fails (TSAN builds) with the messages like below:
> {noformat}
> kudu/integration-tests/tablet_server_quiescing-itest.cc:228: Failure
> Failed
>   
> Bad status: Timed out: Unable to find leader of tablet 
> 65dba49fff3f45678638414af2fe83e4 after 10.005s. Status message: Not found: 
> Error connecting to replica: Timed out: GetConsensusState RPC to 
> 127.0.177.65:42397 timed out after -0.003s (SENT)
> kudu/util/test_util.cc:349: Failure
> Failed
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KUDU-3117) TServerQuiescingITest.TestMajorityQuiescingElectsLeader sometimes fails

2020-05-11 Thread Alexey Serbin (Jira)
Alexey Serbin created KUDU-3117:
---

 Summary: TServerQuiescingITest.TestMajorityQuiescingElectsLeader 
sometimes fails
 Key: KUDU-3117
 URL: https://issues.apache.org/jira/browse/KUDU-3117
 Project: Kudu
  Issue Type: Bug
  Components: test
Affects Versions: 1.12.0
Reporter: Alexey Serbin


The {{TServerQuiescingITest.TestMajorityQuiescingElectsLeader}} test scenario 
sometimes fails (TSAN builds) with the messages like below:

{noformat}
kudu/integration-tests/tablet_server_quiescing-itest.cc:228: Failure
Failed  
Bad status: Timed out: Unable to find leader of tablet 
65dba49fff3f45678638414af2fe83e4 after 10.005s. Status message: Not found: 
Error connecting to replica: Timed out: GetConsensusState RPC to 
127.0.177.65:42397 timed out after -0.003s (SENT)
kudu/util/test_util.cc:349: Failure
Failed
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KUDU-3115) Improve scalability of Kudu masters

2020-05-04 Thread Alexey Serbin (Jira)
Alexey Serbin created KUDU-3115:
---

 Summary: Improve scalability of Kudu masters
 Key: KUDU-3115
 URL: https://issues.apache.org/jira/browse/KUDU-3115
 Project: Kudu
  Issue Type: Improvement
Reporter: Alexey Serbin


Currently, multiple masters in a multi-master Kudu cluster are used only for 
high availability & fault tolerance use cases, but not for sharing the load 
among the available master nodes.  For example, Kudu clients detect current 
leader master upon connecting to the cluster and send all their subsequent 
requests to the leader master, so serving many more clients require running 
masters on more powerful nodes.  Current design assumes that masters store and 
process the requests for metadata only, but that makes sense only up to some 
limit on the rate of incoming client requests.

It would be great to achieve better 'horizontal' scalability for Kudu masters.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-3114) tserver writes core dump when reporting 'out of space'

2020-05-04 Thread Alexey Serbin (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17099242#comment-17099242
 ] 

Alexey Serbin commented on KUDU-3114:
-

Right, it's possible to disable coredumps for Kudu processes by adding 
{{\-\-disable_core_dumps}} even if the limit for core files size of set to 
non-zero.  My point was that enabling/disabling coredumps per {{LOG(FATAL)}} 
instance is not feasible.

Dumping a core file might have sense when troubleshooting an issue: e.g., if 
there is a bug in computing the number of bytes to allocate, what event 
triggered the issue if it's requested to allocate unexpectedly high amount of 
space, etc.  Probably, we can keep that for DEBUG builds only.

I'm OK with keeping this JIRA item open (so, I'm re-opening it).   Feel free to 
submit a patch to address the issue as needed.

> tserver writes core dump when reporting 'out of space'
> --
>
> Key: KUDU-3114
> URL: https://issues.apache.org/jira/browse/KUDU-3114
> Project: Kudu
>  Issue Type: Bug
>  Components: tserver
>Affects Versions: 1.7.1
>Reporter: Balazs Jeszenszky
>Priority: Major
> Fix For: n/a
>
>
> Fatal log has:
> {code}
> F0503 23:56:27.359544 40012 status_callback.cc:35] Enqueued commit operation 
> failed to write to WAL: IO error: Insufficient disk space to allocate 8388608 
> bytes under path  (39973171200 bytes available vs 39988335247 bytes 
> reserved) (error 28)
> {code}
> Generating a core file in this case yields no benefit, and potentially 
> compounds the problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (KUDU-3114) tserver writes core dump when reporting 'out of space'

2020-05-04 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin reopened KUDU-3114:
-

> tserver writes core dump when reporting 'out of space'
> --
>
> Key: KUDU-3114
> URL: https://issues.apache.org/jira/browse/KUDU-3114
> Project: Kudu
>  Issue Type: Bug
>  Components: tserver
>Affects Versions: 1.7.1
>Reporter: Balazs Jeszenszky
>Priority: Major
> Fix For: n/a
>
>
> Fatal log has:
> {code}
> F0503 23:56:27.359544 40012 status_callback.cc:35] Enqueued commit operation 
> failed to write to WAL: IO error: Insufficient disk space to allocate 8388608 
> bytes under path  (39973171200 bytes available vs 39988335247 bytes 
> reserved) (error 28)
> {code}
> Generating a core file in this case yields no benefit, and potentially 
> compounds the problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KUDU-3114) tserver writes core dump when reporting 'out of space'

2020-05-04 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin resolved KUDU-3114.
-
Fix Version/s: n/a
   Resolution: Information Provided

> tserver writes core dump when reporting 'out of space'
> --
>
> Key: KUDU-3114
> URL: https://issues.apache.org/jira/browse/KUDU-3114
> Project: Kudu
>  Issue Type: Bug
>  Components: tserver
>Affects Versions: 1.7.1
>Reporter: Balazs Jeszenszky
>Priority: Major
> Fix For: n/a
>
>
> Fatal log has:
> {code}
> F0503 23:56:27.359544 40012 status_callback.cc:35] Enqueued commit operation 
> failed to write to WAL: IO error: Insufficient disk space to allocate 8388608 
> bytes under path  (39973171200 bytes available vs 39988335247 bytes 
> reserved) (error 28)
> {code}
> Generating a core file in this case yields no benefit, and potentially 
> compounds the problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-3114) tserver writes core dump when reporting 'out of space'

2020-05-04 Thread Alexey Serbin (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17099075#comment-17099075
 ] 

Alexey Serbin commented on KUDU-3114:
-

Thank you for reporting the issue.

The way how fatal inconsistencies are handled in Kudu doesn't provide control 
to choose between coredump behavior.  The behavior of it's controlled at 
different level: the environment that Kudu processes are run with (check 
{{ulimit -c}}).

As a good operational practice, it's advised to separate the location for core 
files (some directory at system partition/volume?) and the directories where 
Kudu stores its data and WAL.  Also, consider [enabling mini-dumps in 
Kudu|https://kudu.apache.org/docs/troubleshooting.html#crash_reporting] and 
disabling core files if dumping cores isn't feasible due to space limitations.

> tserver writes core dump when reporting 'out of space'
> --
>
> Key: KUDU-3114
> URL: https://issues.apache.org/jira/browse/KUDU-3114
> Project: Kudu
>  Issue Type: Bug
>  Components: tserver
>Affects Versions: 1.7.1
>Reporter: Balazs Jeszenszky
>Priority: Major
>
> Fatal log has:
> {code}
> F0503 23:56:27.359544 40012 status_callback.cc:35] Enqueued commit operation 
> failed to write to WAL: IO error: Insufficient disk space to allocate 8388608 
> bytes under path  (39973171200 bytes available vs 39988335247 bytes 
> reserved) (error 28)
> {code}
> Generating a core file in this case yields no benefit, and potentially 
> compounds the problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-3107) TestRpc.TestCancellationMultiThreads fail on ARM sometimes due to service queue is full

2020-04-30 Thread Alexey Serbin (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17096864#comment-17096864
 ] 

Alexey Serbin commented on KUDU-3107:
-

I think the problem is that the code doesn't do proper conversion of the 
RPC-level status code into the application status code.  I think the following 
is missing:

{noformat}
if (controller.status().IsRemoteError()) {
  const ErrorStatusPB* err = rpc->error_response();
  CHECK(err && err->has_code() &&
  (err->code() == ErrorStatusPB::ERROR_SERVER_TOO_BUSY ||
   err->code() == ErrorStatusPB::ERROR_UNAVAILABLE));
}
{noformat}

> TestRpc.TestCancellationMultiThreads fail on ARM sometimes due to service 
> queue is full
> ---
>
> Key: KUDU-3107
> URL: https://issues.apache.org/jira/browse/KUDU-3107
> Project: Kudu
>  Issue Type: Sub-task
>Reporter: liusheng
>Priority: Major
> Attachments: rpc-test.txt
>
>
> The test TestRpc.TestCancellationMultiThreads fail sometimes on ARM mechine 
> due the the "service queue full" error. related  error message:
> {code:java}
> Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 
> (request call id 318)
> Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 
> (request call id 319)
> Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 
> (request call id 320)
> Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 
> (request call id 321)
> Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 
> (request call id 324)
> Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 
> (request call id 332)
> Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 
> (request call id 334)
> Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 
> (request call id 335)
> Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 
> (request call id 336)
> Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 
> (request call id 337)
> Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 
> (request call id 338)
> Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 
> (request call id 339)
> Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 
> (request call id 340)
> Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 
> (request call id 341)
> F0416 13:01:38.616358 31937 rpc-test.cc:1471] Check failed: 
> controller.status().IsAborted() || controller.status().IsServiceUnavailable() 
> || controller.status().ok() Remote error: Service unavailable: PushStrings 
> request on kudu.rpc.GenericCalculatorService from 127.0.0.1:41516 dropped due 
> to backpressure. The service queue is full; it has 100 items.
> *** Check failure stack trace: ***
> PC: @0x0 (unknown)
> *** SIGABRT (@0x3e86bbf) received by PID 27583 (TID 0x84b1f050) from 
> PID 27583; stack trace: ***
> @ 0x93cf0464 raise at ??:0
> @ 0x93cf18b4 abort at ??:0
> @ 0x942c5fdc google::logging_fail() at ??:0
> @ 0x942c7d40 google::LogMessage::Fail() at ??:0
> @ 0x942c9c78 google::LogMessage::SendToLog() at ??:0
> @ 0x942c7874 google::LogMessage::Flush() at ??:0
> @ 0x942ca4fc google::LogMessageFatal::~LogMessageFatal() at ??:0
> @ 0xdcee4940 kudu::rpc::SendAndCancelRpcs() at ??:0
> @ 0xdcee4b98 
> _ZZN4kudu3rpc41TestRpc_TestCancellationMultiThreads_Test8TestBodyEvENKUlvE_clEv
>  at ??:0
> @ 0xdcee76bc 
> _ZSt13__invoke_implIvZN4kudu3rpc41TestRpc_TestCancellationMultiThreads_Test8TestBodyEvEUlvE_JEET_St14__invoke_otherOT0_DpOT1_
>  at ??:0
> @ 0xdcee7484 
> _ZSt8__invokeIZN4kudu3rpc41TestRpc_TestCancellationMultiThreads_Test8TestBodyEvEUlvE_JEENSt15__invoke_resultIT_JDpT0_EE4typeEOS5_DpOS6_
>  at ??:0
> @ 0xdcee8208 
> _ZNSt6thread8_InvokerISt5tupleIJZN4kudu3rpc41TestRpc_TestCancellationMultiThreads_Test8TestBodyEvEUlvE_EEE9_M_invokeIJLm0DTcl8__invokespcl10_S_declvalIXT_ESt12_Index_tupleIJXspT_EEE
>  at ??:0
> @ 0xdcee8168 
> _ZNSt6thread8_InvokerISt5tupleIJZN4kudu3rpc41TestRpc_TestCancellationMultiThreads_Test8TestBodyEvEUlvE_EEEclEv
>  at ??:0
> @ 0xdcee8110 
> _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN4kudu3rpc41TestRpc_TestCancellationMultiThreads_Test8TestBodyEvEUlvE_E6_M_runEv
>  at ??:0
> @ 0x93f22e94 (unknown) at ??:0
> @ 0x93e1e088 start_thread at ??:0
> @ 0x93d8e4ec (unknown) at ??:0
> {code}
> The attatchment is the full test log



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-3106) getEndpointChannelBindings() isn't working as expected with BouncyCastle

2020-04-24 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3106:

Summary: getEndpointChannelBindings() isn't working as expected with 
BouncyCastle  (was: getEndpointChannelBindings() isn't working as expected with 
BouncyCastle 1.65)

> getEndpointChannelBindings() isn't working as expected with BouncyCastle
> 
>
> Key: KUDU-3106
> URL: https://issues.apache.org/jira/browse/KUDU-3106
> Project: Kudu
>  Issue Type: Bug
>  Components: client, java, security
>Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.5.0, 1.6.0, 1.7.0, 1.8.0, 1.7.1, 
> 1.9.0, 1.10.0, 1.10.1, 1.11.0, 1.11.1
>Reporter: Alexey Serbin
>Assignee: Alexey Serbin
>Priority: Major
> Fix For: 1.12.0
>
>
> With [BouncyCastle|https://www.bouncycastle.org] 1.65 the code in 
> https://github.com/apache/kudu/blob/25ae6c5108cc84289f69c467d862e298d3361ea8/java/kudu-client/src/main/java/org/apache/kudu/util/SecurityUtil.java#L136-L159
>  isn't working as expected throwing an exception:
> {noformat}
> java.lang.RuntimeException: cert uses unknown signature algorithm: 
> SHA256WITHRSA
> {noformat}
> It seems BouncyCastle 1.65 converts the name of the certificate signature 
> algorithm uppercase.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KUDU-3111) IWYU processes freestanding headers

2020-04-24 Thread Alexey Serbin (Jira)
Alexey Serbin created KUDU-3111:
---

 Summary: IWYU processes freestanding headers 
 Key: KUDU-3111
 URL: https://issues.apache.org/jira/browse/KUDU-3111
 Project: Kudu
  Issue Type: Improvement
Affects Versions: 1.11.1, 1.11.0, 1.10.1, 1.10.0, 1.9.0, 1.8.0, 1.7.0, 
1.12.0
Reporter: Alexey Serbin


When working out of the compilation database, IWYU processes only associated 
headers, i.e. {{.h}} files that pair corresponding {{.cc}} files.   It would be 
nice to make IWYU processing so-called freestanding header files.  [This 
thread|https://github.com/include-what-you-use/include-what-you-use/issues/268] 
contains very useful information on the topic.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-3111) Make IWYU processes freestanding headers

2020-04-24 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3111:

Summary: Make IWYU processes freestanding headers  (was: IWYU processes 
freestanding headers )

> Make IWYU processes freestanding headers
> 
>
> Key: KUDU-3111
> URL: https://issues.apache.org/jira/browse/KUDU-3111
> Project: Kudu
>  Issue Type: Improvement
>Affects Versions: 1.7.0, 1.8.0, 1.9.0, 1.10.0, 1.10.1, 1.11.0, 1.12.0, 
> 1.11.1
>Reporter: Alexey Serbin
>Priority: Major
>
> When working out of the compilation database, IWYU processes only associated 
> headers, i.e. {{.h}} files that pair corresponding {{.cc}} files.   It would 
> be nice to make IWYU processing so-called freestanding header files.  [This 
> thread|https://github.com/include-what-you-use/include-what-you-use/issues/268]
>  contains very useful information on the topic.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-3007) ARM/aarch64 platform support

2020-04-24 Thread Alexey Serbin (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091778#comment-17091778
 ] 

Alexey Serbin commented on KUDU-3007:
-

Yes, I'm planning to take a closer look this weekend.  Thank you for the 
contribution!

> ARM/aarch64 platform support
> 
>
> Key: KUDU-3007
> URL: https://issues.apache.org/jira/browse/KUDU-3007
> Project: Kudu
>  Issue Type: Improvement
>Reporter: liusheng
>Priority: Critical
>
> As an import alternative of x86 architecture, Aarch64(ARM) architecture  is 
> currently the dominate architecture in small devices like phone, IOT devices, 
> security cameras, drones etc. And also, there are more and more hadware or 
> cloud vendor start to provide ARM resources, such as AWS, Huawei, Packet, 
> Ampere. etc. Usually, the ARM servers are low cost and more cheap than x86 
> servers, and now more and more ARM servers have comparative performance with 
> x86 servers, and even more efficient in some areas.
> We want to propose to add an Aarch64 CI for KUDU to promote the support for 
> KUDU on Aarch64 platforms. We are willing to provide machines to the current 
> CI system and manpower to mananging the CI and fxing problems that occours.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KUDU-2986) Incorrect value for the 'live_row_count' metric with pre-1.11.0 tables

2020-04-17 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-2986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin resolved KUDU-2986.
-
Fix Version/s: 1.12.0
   Resolution: Fixed

> Incorrect value for the 'live_row_count' metric with pre-1.11.0 tables
> --
>
> Key: KUDU-2986
> URL: https://issues.apache.org/jira/browse/KUDU-2986
> Project: Kudu
>  Issue Type: Bug
>  Components: CLI, client, master, metrics
>Affects Versions: 1.11.0
>Reporter: YifanZhang
>Assignee: LiFu He
>Priority: Major
> Fix For: 1.12.0
>
>
> When we upgraded the cluster with pre-1.11.0 tables, we got inconsistent 
> values for the 'live_row_count' metric of these tables:
> When visiting masterURL:port/metrics, we got 0 for old tables, and got a 
> positive integer for a old table with a newly added partition, which is the 
> count of rows in the newly added partition.
> When getting table statistics via `kudu table statistics` CLI tool, we got 0 
> for old tables and the old table with a new parition.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-3106) getEndpointChannelBindings() isn't working as expected with BouncyCastle 1.65

2020-04-06 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3106:

Fix Version/s: 1.12.0
   Resolution: Fixed
   Status: Resolved  (was: In Review)

> getEndpointChannelBindings() isn't working as expected with BouncyCastle 1.65
> -
>
> Key: KUDU-3106
> URL: https://issues.apache.org/jira/browse/KUDU-3106
> Project: Kudu
>  Issue Type: Bug
>  Components: client, java, security
>Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.5.0, 1.6.0, 1.7.0, 1.8.0, 1.7.1, 
> 1.9.0, 1.10.0, 1.10.1, 1.11.0, 1.11.1
>Reporter: Alexey Serbin
>Assignee: Alexey Serbin
>Priority: Major
> Fix For: 1.12.0
>
>
> With [BouncyCastle|https://www.bouncycastle.org] 2.65 the code in 
> https://github.com/apache/kudu/blob/25ae6c5108cc84289f69c467d862e298d3361ea8/java/kudu-client/src/main/java/org/apache/kudu/util/SecurityUtil.java#L136-L159
>  isn't working as expected throwing an exception:
> {noformat}
> java.lang.RuntimeException: cert uses unknown signature algorithm: 
> SHA256WITHRSA
> {noformat}
> It seems BouncyCastle 1.65 converts the name of the certificate signature 
> algorithm uppercase.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-3106) getEndpointChannelBindings() isn't working as expected with BouncyCastle 1.65

2020-04-06 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3106:

Description: 
With [BouncyCastle|https://www.bouncycastle.org] 1.65 the code in 
https://github.com/apache/kudu/blob/25ae6c5108cc84289f69c467d862e298d3361ea8/java/kudu-client/src/main/java/org/apache/kudu/util/SecurityUtil.java#L136-L159
 isn't working as expected throwing an exception:

{noformat}
java.lang.RuntimeException: cert uses unknown signature algorithm: SHA256WITHRSA
{noformat}

It seems BouncyCastle 1.65 converts the name of the certificate signature 
algorithm uppercase.

  was:
With [BouncyCastle|https://www.bouncycastle.org] 2.65 the code in 
https://github.com/apache/kudu/blob/25ae6c5108cc84289f69c467d862e298d3361ea8/java/kudu-client/src/main/java/org/apache/kudu/util/SecurityUtil.java#L136-L159
 isn't working as expected throwing an exception:

{noformat}
java.lang.RuntimeException: cert uses unknown signature algorithm: SHA256WITHRSA
{noformat}

It seems BouncyCastle 1.65 converts the name of the certificate signature 
algorithm uppercase.


> getEndpointChannelBindings() isn't working as expected with BouncyCastle 1.65
> -
>
> Key: KUDU-3106
> URL: https://issues.apache.org/jira/browse/KUDU-3106
> Project: Kudu
>  Issue Type: Bug
>  Components: client, java, security
>Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.5.0, 1.6.0, 1.7.0, 1.8.0, 1.7.1, 
> 1.9.0, 1.10.0, 1.10.1, 1.11.0, 1.11.1
>Reporter: Alexey Serbin
>Assignee: Alexey Serbin
>Priority: Major
> Fix For: 1.12.0
>
>
> With [BouncyCastle|https://www.bouncycastle.org] 1.65 the code in 
> https://github.com/apache/kudu/blob/25ae6c5108cc84289f69c467d862e298d3361ea8/java/kudu-client/src/main/java/org/apache/kudu/util/SecurityUtil.java#L136-L159
>  isn't working as expected throwing an exception:
> {noformat}
> java.lang.RuntimeException: cert uses unknown signature algorithm: 
> SHA256WITHRSA
> {noformat}
> It seems BouncyCastle 1.65 converts the name of the certificate signature 
> algorithm uppercase.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-3106) getEndpointChannelBindings() isn't working as expected with BouncyCastle 1.65

2020-04-06 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3106:

Status: In Review  (was: In Progress)

> getEndpointChannelBindings() isn't working as expected with BouncyCastle 1.65
> -
>
> Key: KUDU-3106
> URL: https://issues.apache.org/jira/browse/KUDU-3106
> Project: Kudu
>  Issue Type: Bug
>  Components: client, java, security
>Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.5.0, 1.6.0, 1.7.0, 1.8.0, 1.7.1, 
> 1.9.0, 1.10.0, 1.10.1, 1.11.0, 1.11.1
>Reporter: Alexey Serbin
>Assignee: Alexey Serbin
>Priority: Major
>
> With [BouncyCastle|https://www.bouncycastle.org] 2.65 the code in 
> https://github.com/apache/kudu/blob/25ae6c5108cc84289f69c467d862e298d3361ea8/java/kudu-client/src/main/java/org/apache/kudu/util/SecurityUtil.java#L136-L159
>  isn't working as expected throwing an exception:
> {noformat}
> java.lang.RuntimeException: cert uses unknown signature algorithm: 
> SHA256WITHRSA
> {noformat}
> It seems BouncyCastle 1.65 converts the name of the certificate signature 
> algorithm uppercase.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-3106) getEndpointChannelBindings() isn't working as expected with BouncyCastle 1.65

2020-04-06 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3106:

Code Review: http://gerrit.cloudera.org:8080/15664

> getEndpointChannelBindings() isn't working as expected with BouncyCastle 1.65
> -
>
> Key: KUDU-3106
> URL: https://issues.apache.org/jira/browse/KUDU-3106
> Project: Kudu
>  Issue Type: Bug
>  Components: client, java, security
>Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.5.0, 1.6.0, 1.7.0, 1.8.0, 1.7.1, 
> 1.9.0, 1.10.0, 1.10.1, 1.11.0, 1.11.1
>Reporter: Alexey Serbin
>Assignee: Alexey Serbin
>Priority: Major
>
> With [BouncyCastle|https://www.bouncycastle.org] 2.65 the code in 
> https://github.com/apache/kudu/blob/25ae6c5108cc84289f69c467d862e298d3361ea8/java/kudu-client/src/main/java/org/apache/kudu/util/SecurityUtil.java#L136-L159
>  isn't working as expected throwing an exception:
> {noformat}
> java.lang.RuntimeException: cert uses unknown signature algorithm: 
> SHA256WITHRSA
> {noformat}
> It seems BouncyCastle 1.65 converts the name of the certificate signature 
> algorithm uppercase.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (KUDU-3106) getEndpointChannelBindings() isn't working as expected with BouncyCastle 1.65

2020-04-06 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin reassigned KUDU-3106:
---

Assignee: Alexey Serbin

> getEndpointChannelBindings() isn't working as expected with BouncyCastle 1.65
> -
>
> Key: KUDU-3106
> URL: https://issues.apache.org/jira/browse/KUDU-3106
> Project: Kudu
>  Issue Type: Bug
>  Components: client, java, security
>Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.5.0, 1.6.0, 1.7.0, 1.8.0, 1.7.1, 
> 1.9.0, 1.10.0, 1.10.1, 1.11.0, 1.11.1
>Reporter: Alexey Serbin
>Assignee: Alexey Serbin
>Priority: Major
>
> With [BouncyCastle|https://www.bouncycastle.org] 2.65 the code in 
> https://github.com/apache/kudu/blob/25ae6c5108cc84289f69c467d862e298d3361ea8/java/kudu-client/src/main/java/org/apache/kudu/util/SecurityUtil.java#L136-L159
>  isn't working as expected throwing an exception:
> {noformat}
> java.lang.RuntimeException: cert uses unknown signature algorithm: 
> SHA256WITHRSA
> {noformat}
> It seems BouncyCastle 1.65 converts the name of the certificate signature 
> algorithm uppercase.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-3106) getEndpointChannelBindings() isn't working as expected with BouncyCastle 1.65

2020-04-06 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3106:

Summary: getEndpointChannelBindings() isn't working as expected with 
BouncyCastle 1.65  (was: getEndpointChannelBindings() isn't working as expected 
with BouncyCastle 2.65)

> getEndpointChannelBindings() isn't working as expected with BouncyCastle 1.65
> -
>
> Key: KUDU-3106
> URL: https://issues.apache.org/jira/browse/KUDU-3106
> Project: Kudu
>  Issue Type: Bug
>  Components: client, java, security
>Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.5.0, 1.6.0, 1.7.0, 1.8.0, 1.7.1, 
> 1.9.0, 1.10.0, 1.10.1, 1.11.0, 1.11.1
>Reporter: Alexey Serbin
>Priority: Major
>
> With [BouncyCastle|https://www.bouncycastle.org] 2.65 the code in 
> https://github.com/apache/kudu/blob/25ae6c5108cc84289f69c467d862e298d3361ea8/java/kudu-client/src/main/java/org/apache/kudu/util/SecurityUtil.java#L136-L159
>  isn't working as expected throwing an exception:
> {noformat}
> java.lang.RuntimeException: cert uses unknown signature algorithm: 
> SHA256WITHRSA
> {noformat}
> It seems BouncyCastle 2.65 converts the name of the certificate signature 
> algorithm uppercase.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-3106) getEndpointChannelBindings() isn't working as expected with BouncyCastle 1.65

2020-04-06 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3106:

Description: 
With [BouncyCastle|https://www.bouncycastle.org] 2.65 the code in 
https://github.com/apache/kudu/blob/25ae6c5108cc84289f69c467d862e298d3361ea8/java/kudu-client/src/main/java/org/apache/kudu/util/SecurityUtil.java#L136-L159
 isn't working as expected throwing an exception:

{noformat}
java.lang.RuntimeException: cert uses unknown signature algorithm: SHA256WITHRSA
{noformat}

It seems BouncyCastle 1.65 converts the name of the certificate signature 
algorithm uppercase.

  was:
With [BouncyCastle|https://www.bouncycastle.org] 2.65 the code in 
https://github.com/apache/kudu/blob/25ae6c5108cc84289f69c467d862e298d3361ea8/java/kudu-client/src/main/java/org/apache/kudu/util/SecurityUtil.java#L136-L159
 isn't working as expected throwing an exception:

{noformat}
java.lang.RuntimeException: cert uses unknown signature algorithm: SHA256WITHRSA
{noformat}

It seems BouncyCastle 2.65 converts the name of the certificate signature 
algorithm uppercase.


> getEndpointChannelBindings() isn't working as expected with BouncyCastle 1.65
> -
>
> Key: KUDU-3106
> URL: https://issues.apache.org/jira/browse/KUDU-3106
> Project: Kudu
>  Issue Type: Bug
>  Components: client, java, security
>Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.5.0, 1.6.0, 1.7.0, 1.8.0, 1.7.1, 
> 1.9.0, 1.10.0, 1.10.1, 1.11.0, 1.11.1
>Reporter: Alexey Serbin
>Priority: Major
>
> With [BouncyCastle|https://www.bouncycastle.org] 2.65 the code in 
> https://github.com/apache/kudu/blob/25ae6c5108cc84289f69c467d862e298d3361ea8/java/kudu-client/src/main/java/org/apache/kudu/util/SecurityUtil.java#L136-L159
>  isn't working as expected throwing an exception:
> {noformat}
> java.lang.RuntimeException: cert uses unknown signature algorithm: 
> SHA256WITHRSA
> {noformat}
> It seems BouncyCastle 1.65 converts the name of the certificate signature 
> algorithm uppercase.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KUDU-3106) getEndpointChannelBindings() isn't working as expected with BouncyCastle 2.65

2020-04-06 Thread Alexey Serbin (Jira)
Alexey Serbin created KUDU-3106:
---

 Summary: getEndpointChannelBindings() isn't working as expected 
with BouncyCastle 2.65
 Key: KUDU-3106
 URL: https://issues.apache.org/jira/browse/KUDU-3106
 Project: Kudu
  Issue Type: Bug
  Components: client, java, security
Affects Versions: 1.11.1, 1.11.0, 1.10.1, 1.10.0, 1.9.0, 1.7.1, 1.8.0, 
1.7.0, 1.6.0, 1.5.0, 1.4.0, 1.3.1, 1.3.0
Reporter: Alexey Serbin


With [BouncyCastle|https://www.bouncycastle.org] 2.65 the code in 
https://github.com/apache/kudu/blob/25ae6c5108cc84289f69c467d862e298d3361ea8/java/kudu-client/src/main/java/org/apache/kudu/util/SecurityUtil.java#L136-L159
 isn't working as expected throwing an exception:

{noformat}
java.lang.RuntimeException: cert uses unknown signature algorithm: SHA256WITHRSA
{noformat}

It seems BouncyCastle 2.65 converts the name of the certificate signature 
algorithm uppercase.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-2573) Fully support Chrony in place of NTP

2020-04-06 Thread Alexey Serbin (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-2573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17076583#comment-17076583
 ] 

Alexey Serbin commented on KUDU-2573:
-

With this [changelist|https://gerrit.cloudera.org/#/c/15456/], the necessary 
piece of the documentation will be in 1.12 release notes.

> Fully support Chrony in place of NTP
> 
>
> Key: KUDU-2573
> URL: https://issues.apache.org/jira/browse/KUDU-2573
> Project: Kudu
>  Issue Type: New Feature
>  Components: clock, master, tserver
>Reporter: Grant Henke
>Assignee: Alexey Serbin
>Priority: Major
>  Labels: clock
>
> This is to track fully supporting Chrony in place of NTP. Given Chrony is the 
> default in RHEL7+, running Kudu with Chrony is likely to be more common. 
> The work should entail:
>  * identifying and fixing or documenting any differences or gaps
>  * removing the experimental warnings from the documentation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KUDU-2573) Fully support Chrony in place of NTP

2020-04-06 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-2573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin resolved KUDU-2573.
-
Fix Version/s: 1.12.0
   Resolution: Fixed

> Fully support Chrony in place of NTP
> 
>
> Key: KUDU-2573
> URL: https://issues.apache.org/jira/browse/KUDU-2573
> Project: Kudu
>  Issue Type: New Feature
>  Components: clock, master, tserver
>Reporter: Grant Henke
>Assignee: Alexey Serbin
>Priority: Major
>  Labels: clock
> Fix For: 1.12.0
>
>
> This is to track fully supporting Chrony in place of NTP. Given Chrony is the 
> default in RHEL7+, running Kudu with Chrony is likely to be more common. 
> The work should entail:
>  * identifying and fixing or documenting any differences or gaps
>  * removing the experimental warnings from the documentation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-2798) Fix logging on deleted TSK entries

2020-04-06 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-2798:

Affects Version/s: 1.10.1
   1.11.0
   1.11.1

> Fix logging on deleted TSK entries
> --
>
> Key: KUDU-2798
> URL: https://issues.apache.org/jira/browse/KUDU-2798
> Project: Kudu
>  Issue Type: Task
>Affects Versions: 1.8.0, 1.9.0, 1.9.1, 1.10.0, 1.10.1, 1.11.0, 1.11.1
>Reporter: Alexey Serbin
>Assignee: Alexey Serbin
>Priority: Minor
>  Labels: newbie
>
> It seems the identifiers of the deleted TSK entries in the log lines below 
> need decoding:
> {noformat}
> I0312 15:17:14.808763 71553 catalog_manager.cc:4095] T 
>  P f05d759af7824df9aafedcc106674182: 
> Generated new TSK 2
> I0312 15:17:14.811144 71553 catalog_manager.cc:4133] T 
>  P f05d759af7824df9aafedcc106674182: Deleted 
> TSKs: �, �
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-2798) Fix logging on deleted TSK entries

2020-04-06 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-2798:

Code Review: https://gerrit.cloudera.org/#/c/15657/

> Fix logging on deleted TSK entries
> --
>
> Key: KUDU-2798
> URL: https://issues.apache.org/jira/browse/KUDU-2798
> Project: Kudu
>  Issue Type: Task
>Affects Versions: 1.8.0, 1.9.0, 1.9.1, 1.10.0
>Reporter: Alexey Serbin
>Assignee: Alexey Serbin
>Priority: Minor
>  Labels: newbie
>
> It seems the identifiers of the deleted TSK entries in the log lines below 
> need decoding:
> {noformat}
> I0312 15:17:14.808763 71553 catalog_manager.cc:4095] T 
>  P f05d759af7824df9aafedcc106674182: 
> Generated new TSK 2
> I0312 15:17:14.811144 71553 catalog_manager.cc:4133] T 
>  P f05d759af7824df9aafedcc106674182: Deleted 
> TSKs: �, �
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (KUDU-2798) Fix logging on deleted TSK entries

2020-04-06 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin reassigned KUDU-2798:
---

Assignee: Alexey Serbin

> Fix logging on deleted TSK entries
> --
>
> Key: KUDU-2798
> URL: https://issues.apache.org/jira/browse/KUDU-2798
> Project: Kudu
>  Issue Type: Task
>Affects Versions: 1.8.0, 1.9.0, 1.9.1, 1.10.0
>Reporter: Alexey Serbin
>Assignee: Alexey Serbin
>Priority: Minor
>  Labels: newbie
>
> It seems the identifiers of the deleted TSK entries in the log lines below 
> need decoding:
> {noformat}
> I0312 15:17:14.808763 71553 catalog_manager.cc:4095] T 
>  P f05d759af7824df9aafedcc106674182: 
> Generated new TSK 2
> I0312 15:17:14.811144 71553 catalog_manager.cc:4133] T 
>  P f05d759af7824df9aafedcc106674182: Deleted 
> TSKs: �, �
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-2798) Fix logging on deleted TSK entries

2020-04-06 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-2798:

Status: In Review  (was: In Progress)

> Fix logging on deleted TSK entries
> --
>
> Key: KUDU-2798
> URL: https://issues.apache.org/jira/browse/KUDU-2798
> Project: Kudu
>  Issue Type: Task
>Affects Versions: 1.8.0, 1.9.0, 1.9.1, 1.10.0
>Reporter: Alexey Serbin
>Assignee: Alexey Serbin
>Priority: Minor
>  Labels: newbie
>
> It seems the identifiers of the deleted TSK entries in the log lines below 
> need decoding:
> {noformat}
> I0312 15:17:14.808763 71553 catalog_manager.cc:4095] T 
>  P f05d759af7824df9aafedcc106674182: 
> Generated new TSK 2
> I0312 15:17:14.811144 71553 catalog_manager.cc:4133] T 
>  P f05d759af7824df9aafedcc106674182: Deleted 
> TSKs: �, �
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KUDU-3105) kudu_client based application reports 'Locking callback not initialized' error

2020-04-03 Thread Alexey Serbin (Jira)
Alexey Serbin created KUDU-3105:
---

 Summary: kudu_client based application reports 'Locking callback 
not initialized' error
 Key: KUDU-3105
 URL: https://issues.apache.org/jira/browse/KUDU-3105
 Project: Kudu
  Issue Type: Bug
  Components: client, python, security
Affects Versions: 1.11.1, 1.11.0, 1.10.1, 1.10.0, 1.9.0
Reporter: Alexey Serbin


When using kudu_client library compiled against OpenSSL 1.0.x with OpenSSL 
1.1.x run-time, Kudu client applications might report 'Runtime error: Locking 
callback not initialized' error.

For example, {{kudu-python}} based applications on RHEL/CentOS 7.7, if using 
{{kudu-client}} of versions 1.9, 1.10, 1.11 in Python environment with OpenSSL 
1.1.1d might report an error like below:

{noformat}
Traceback (most recent call last):
  File "kudu-python-app.py", line 22, in 
client = kudu.connect(host=args.masters, port=args.ports)
  File "/opt/lib/python3.6/site-packages/kudu/__init__.py", line 96, in connect
rpc_timeout_ms=rpc_timeout_ms)
  File "kudu/client.pyx", line 297, in kudu.client.Client.__cinit__
  File "kudu/errors.pyx", line 62, in kudu.errors.check_status
kudu.errors.KuduBadStatus: b'Runtime error: Locking callback not initialized'
{noformat}

The issue is that the code {{libkudu_client}} compiled against OpenSSL 1.0.x 
uses initialization code path specific for OpenSSL 1.0.x version, and the 
post-condition requires presence of thread-safe callbacks installed after the 
initialization is done.  However, those functions do not install the expected 
locking callbacks in OpenSSL 1.1.x since OpenSSL uses different approach w.r.t. 
locking callbacks since 1.1.0 version: the callbacks are not required since the 
multi-threading model was revamped in the newer versions of the library.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (KUDU-3082) tablets in "CONSENSUS_MISMATCH" state for a long time

2020-04-01 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin reassigned KUDU-3082:
---

Assignee: Alexey Serbin

> tablets in "CONSENSUS_MISMATCH" state for a long time
> -
>
> Key: KUDU-3082
> URL: https://issues.apache.org/jira/browse/KUDU-3082
> Project: Kudu
>  Issue Type: Bug
>  Components: consensus
>Affects Versions: 1.10.1
>Reporter: YifanZhang
>Assignee: Alexey Serbin
>Priority: Major
> Attachments: master_leader.log, ts25.info.gz, ts26.log.gz
>
>
> Lately we found a few tablets in one of our clusters are unhealthy, the ksck 
> output is like:
>  
> {code:java}
> Tablet Summary
> Tablet 7404240f458f462d92b6588d07583a52 of table '' is conflicted: 3 
> replicas' active configs disagree with the leader master's
>   7380d797d2ea49e88d71091802fb1c81 (kudu-ts26): RUNNING
>   d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING
>   47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
> All reported replicas are:
>   A = 7380d797d2ea49e88d71091802fb1c81
>   B = d1952499f94a4e6087bee28466fcb09f
>   C = 47af52df1adc47e1903eb097e9c88f2e
>   D = 08beca5ed4d04003b6979bf8bac378d2
> The consensus matrix is:
>  Config source | Replicas | Current term | Config index | Committed?
> ---+--+--+--+
>  master| A   B   C*   |  |  | Yes
>  A | A   B   C*   | 5| -1   | Yes
>  B | A   B   C| 5| -1   | Yes
>  C | A   B   C*  D~   | 5| 54649| No
> Tablet 6d9d3fb034314fa7bee9cfbf602bcdc8 of table '' is conflicted: 2 
> replicas' active configs disagree with the leader master's
>   d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING
>   47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
>   5a8aeadabdd140c29a09dabcae919b31 (kudu-ts21): RUNNING
> All reported replicas are:
>   A = d1952499f94a4e6087bee28466fcb09f
>   B = 47af52df1adc47e1903eb097e9c88f2e
>   C = 5a8aeadabdd140c29a09dabcae919b31
>   D = 14632cdbb0d04279bc772f64e06389f9
> The consensus matrix is:
>  Config source | Replicas | Current term | Config index | Committed?
> ---+--+--+--+
>  master| A   B*  C|  |  | Yes
>  A | A   B*  C| 5| 5| Yes
>  B | A   B*  C   D~   | 5| 96176| No
>  C | A   B*  C| 5| 5| Yes
> Tablet bf1ec7d693b94632b099dc0928e76363 of table '' is conflicted: 1 
> replicas' active configs disagree with the leader master's
>   a9eaff3cf1ed483aae84954d649a (kudu-ts23): RUNNING
>   f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING
>   47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
> All reported replicas are:
>   A = a9eaff3cf1ed483aae84954d649a
>   B = f75df4a6b5ce404884313af5f906b392
>   C = 47af52df1adc47e1903eb097e9c88f2e
>   D = d1952499f94a4e6087bee28466fcb09f
> The consensus matrix is:
>  Config source | Replicas | Current term | Config index | Committed?
> ---+--+--+--+
>  master| A   B   C*   |  |  | Yes
>  A | A   B   C*   | 1| -1   | Yes
>  B | A   B   C*   | 1| -1   | Yes
>  C | A   B   C*  D~   | 1| 2| No
> Tablet 3190a310857e4c64997adb477131488a of table '' is conflicted: 3 
> replicas' active configs disagree with the leader master's
>   47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
>   f0f7b2f4b9d344e6929105f48365f38e (kudu-ts24): RUNNING
>   f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING
> All reported replicas are:
>   A = 47af52df1adc47e1903eb097e9c88f2e
>   B = f0f7b2f4b9d344e6929105f48365f38e
>   C = f75df4a6b5ce404884313af5f906b392
>   D = d1952499f94a4e6087bee28466fcb09f
> The consensus matrix is:
>  Config source | Replicas | Current term | Config index | Committed?
> ---+--+--+--+
>  master| A*  B   C|  |  | Yes
>  A | A*  B   C   D~   | 1| 1991 | No
>  B | A*  B   C| 1| 4| Yes
>  C | A*  B   C| 1| 4| Yes{code}
> These tablets couldn't recover for a couple of days until we restart 
> kudu-ts27.
> I found so many duplicated logs in kudu-ts27 are like:
> {code:java}
> I0314 04:38:41.511279 65731 raft_con

[jira] [Assigned] (KUDU-3098) leadership change during tablet_copy process may lead to an isolate replica

2020-04-01 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin reassigned KUDU-3098:
---

Assignee: Alexey Serbin

> leadership change during tablet_copy process may lead to an isolate replica
> ---
>
> Key: KUDU-3098
> URL: https://issues.apache.org/jira/browse/KUDU-3098
> Project: Kudu
>  Issue Type: Bug
>  Components: consensus, master
>Affects Versions: 1.10.1
>Reporter: YifanZhang
>Assignee: Alexey Serbin
>Priority: Major
>
> Lately we found some tablets in a cluster with a very large 
> "time_since_last_leader_heartbeat" metric, they are LEARNER/NON_VOTER and 
> seems couldn't become VOTER for a long time.
> These replicas created during the rebalance/tablet_copy process. After 
> beginning a new copy session from leader to the new added NON_VOTER peer, 
> leadership changed, old leader aborted uncommited CHANGE_CONFIG_OP operation. 
> Finally the tablet_copy session ended but new leader knew nothing about the 
> new peer. 
> Master didn't delete this new added replica because it has a larger 
> opid_index than the latest reported committed config. See the comments in 
> CatalogManager::ProcessTabletReport
> {code:java}
> // 5. Tombstone a replica that is no longer part of the Raft config (and
> // not already tombstoned or deleted outright).
> //
> // If the report includes a committed raft config, we only tombstone if
> // the opid_index is strictly less than the latest reported committed
> // config. This prevents us from spuriously deleting replicas that have
> // just been added to the committed config and are in the process of copying.
> {code}
> Maybe we shouldn't use opid_index to determine if replicas are in the process 
> of copying.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KUDU-3100) AutoRebalancerTest.TestHandlingFailedTservers sometimes fails

2020-03-31 Thread Alexey Serbin (Jira)
Alexey Serbin created KUDU-3100:
---

 Summary: AutoRebalancerTest.TestHandlingFailedTservers sometimes 
fails
 Key: KUDU-3100
 URL: https://issues.apache.org/jira/browse/KUDU-3100
 Project: Kudu
  Issue Type: Bug
  Components: test
Affects Versions: 1.12.0
Reporter: Alexey Serbin
 Attachments: auto_rebalancer-test.txt.xz

The {{AutoRebalancerTest.TestHandlingFailedTservers}} sometimes fails with the 
following error messages:

{noformat}
W0327 22:40:10.759768  6796 auto_rebalancer.cc:666] Could not move replica: 
Network error: Client connection negotiation failed: client connection to 
127.2.102.194:33557: connect: Connection refused (error 111)
/data0/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/master/auto_rebalancer-test.cc:524:
 Failure
Value of: matched   
  Actual: false 
Expected: true  
not one string matched pattern scheduled replica move failed to complete: 
Network error
{noformat}

The log is attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-3100) AutoRebalancerTest.TestHandlingFailedTservers sometimes fails

2020-03-31 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3100:

Attachment: auto_rebalancer-test.txt.xz

> AutoRebalancerTest.TestHandlingFailedTservers sometimes fails
> -
>
> Key: KUDU-3100
> URL: https://issues.apache.org/jira/browse/KUDU-3100
> Project: Kudu
>  Issue Type: Bug
>  Components: test
>Affects Versions: 1.12.0
>Reporter: Alexey Serbin
>Priority: Major
> Attachments: auto_rebalancer-test.txt.xz
>
>
> The {{AutoRebalancerTest.TestHandlingFailedTservers}} sometimes fails with 
> the following error messages:
> {noformat}
> W0327 22:40:10.759768  6796 auto_rebalancer.cc:666] Could not move replica: 
> Network error: Client connection negotiation failed: client connection to 
> 127.2.102.194:33557: connect: Connection refused (error 111)
> /data0/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/master/auto_rebalancer-test.cc:524:
>  Failure
> Value of: matched 
>   
>   Actual: false   
>   
> Expected: true
>   
> not one string matched pattern scheduled replica move failed to complete: 
> Network error
> {noformat}
> The log is attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KUDU-3095) RaftConsensusNonVoterITest.PromotedReplicaCanVote sometimes fails

2020-03-29 Thread Alexey Serbin (Jira)
Alexey Serbin created KUDU-3095:
---

 Summary: RaftConsensusNonVoterITest.PromotedReplicaCanVote  
sometimes fails
 Key: KUDU-3095
 URL: https://issues.apache.org/jira/browse/KUDU-3095
 Project: Kudu
  Issue Type: Bug
  Components: test
Affects Versions: 1.12.0
Reporter: Alexey Serbin
 Attachments: raft_consensus_nonvoter-itest.txt.xz

The {{RaftConsensusNonVoterITest.PromotedReplicaCanVote}} scenario sometimes 
fails with an error:

{noformat}
I0327 00:44:00.297801  4401 raft_consensus.cc:2810] T 
c2378cfec6604e0e813f43775107f2e6 P 4f6b943b18a649fabbd6cfb8d06ed20f [term 3 
FOLLOWER]: CHANGE_CONFIG_OP replication failed: Aborted: Transaction aborted by 
new leader
/data0/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/integration-tests/raft_consensus_nonvoter-itest.cc:1079:
 Failure
Failed 
{noformat}

The log is attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KUDU-3094) KuduTest.TestWebUIDoesNotCrashCluster sometimes fail

2020-03-29 Thread Alexey Serbin (Jira)
Alexey Serbin created KUDU-3094:
---

 Summary: KuduTest.TestWebUIDoesNotCrashCluster sometimes fail
 Key: KUDU-3094
 URL: https://issues.apache.org/jira/browse/KUDU-3094
 Project: Kudu
  Issue Type: Bug
  Components: test
Affects Versions: 1.12.0
Reporter: Alexey Serbin
 Attachments: webserver-stress-itest.txt.xz

The {{KuduTest.TestWebUIDoesNotCrashCluster}} test scenario sometimes fails, 
timing out on creating the test table:

{noformat}
F0327 00:47:41.845475   361 test_workload.cc:329] Timed out: Timed out waiting 
for Table Creation
{noformat}

The log is attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-3082) tablets in "CONSENSUS_MISMATCH" state for a long time

2020-03-26 Thread Alexey Serbin (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17068181#comment-17068181
 ] 

Alexey Serbin commented on KUDU-3082:
-

[~zhangyifan27], do you have an idea what might lead to such a situation?  
Anything specific happened to the cluster?  I'm trying to have a reproduction 
scenario for this.  Any hint might be useful.  Thanks!

> tablets in "CONSENSUS_MISMATCH" state for a long time
> -
>
> Key: KUDU-3082
> URL: https://issues.apache.org/jira/browse/KUDU-3082
> Project: Kudu
>  Issue Type: Bug
>  Components: consensus
>Affects Versions: 1.10.1
>Reporter: YifanZhang
>Priority: Major
>
> Lately we found a few tablets in one of our clusters are unhealthy, the ksck 
> output is like:
>  
> {code:java}
> Tablet Summary
> Tablet 7404240f458f462d92b6588d07583a52 of table '' is conflicted: 3 
> replicas' active configs disagree with the leader master's
>   7380d797d2ea49e88d71091802fb1c81 (kudu-ts26): RUNNING
>   d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING
>   47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
> All reported replicas are:
>   A = 7380d797d2ea49e88d71091802fb1c81
>   B = d1952499f94a4e6087bee28466fcb09f
>   C = 47af52df1adc47e1903eb097e9c88f2e
>   D = 08beca5ed4d04003b6979bf8bac378d2
> The consensus matrix is:
>  Config source | Replicas | Current term | Config index | Committed?
> ---+--+--+--+
>  master| A   B   C*   |  |  | Yes
>  A | A   B   C*   | 5| -1   | Yes
>  B | A   B   C| 5| -1   | Yes
>  C | A   B   C*  D~   | 5| 54649| No
> Tablet 6d9d3fb034314fa7bee9cfbf602bcdc8 of table '' is conflicted: 2 
> replicas' active configs disagree with the leader master's
>   d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING
>   47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
>   5a8aeadabdd140c29a09dabcae919b31 (kudu-ts21): RUNNING
> All reported replicas are:
>   A = d1952499f94a4e6087bee28466fcb09f
>   B = 47af52df1adc47e1903eb097e9c88f2e
>   C = 5a8aeadabdd140c29a09dabcae919b31
>   D = 14632cdbb0d04279bc772f64e06389f9
> The consensus matrix is:
>  Config source | Replicas | Current term | Config index | Committed?
> ---+--+--+--+
>  master| A   B*  C|  |  | Yes
>  A | A   B*  C| 5| 5| Yes
>  B | A   B*  C   D~   | 5| 96176| No
>  C | A   B*  C| 5| 5| Yes
> Tablet bf1ec7d693b94632b099dc0928e76363 of table '' is conflicted: 1 
> replicas' active configs disagree with the leader master's
>   a9eaff3cf1ed483aae84954d649a (kudu-ts23): RUNNING
>   f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING
>   47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
> All reported replicas are:
>   A = a9eaff3cf1ed483aae84954d649a
>   B = f75df4a6b5ce404884313af5f906b392
>   C = 47af52df1adc47e1903eb097e9c88f2e
>   D = d1952499f94a4e6087bee28466fcb09f
> The consensus matrix is:
>  Config source | Replicas | Current term | Config index | Committed?
> ---+--+--+--+
>  master| A   B   C*   |  |  | Yes
>  A | A   B   C*   | 1| -1   | Yes
>  B | A   B   C*   | 1| -1   | Yes
>  C | A   B   C*  D~   | 1| 2| No
> Tablet 3190a310857e4c64997adb477131488a of table '' is conflicted: 3 
> replicas' active configs disagree with the leader master's
>   47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
>   f0f7b2f4b9d344e6929105f48365f38e (kudu-ts24): RUNNING
>   f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING
> All reported replicas are:
>   A = 47af52df1adc47e1903eb097e9c88f2e
>   B = f0f7b2f4b9d344e6929105f48365f38e
>   C = f75df4a6b5ce404884313af5f906b392
>   D = d1952499f94a4e6087bee28466fcb09f
> The consensus matrix is:
>  Config source | Replicas | Current term | Config index | Committed?
> ---+--+--+--+
>  master| A*  B   C|  |  | Yes
>  A | A*  B   C   D~   | 1| 1991 | No
>  B | A*  B   C| 1| 4| Yes
>  C | A*  B   C| 1| 4| Yes{code}
> These tablets couldn't recover for a couple of days until we resta

[jira] [Updated] (KUDU-3087) Python tests failed on arm64

2020-03-24 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3087:

Status: In Review  (was: In Progress)

> Python tests failed on arm64
> 
>
> Key: KUDU-3087
> URL: https://issues.apache.org/jira/browse/KUDU-3087
> Project: Kudu
>  Issue Type: Sub-task
>Reporter: huangtianhua
>Assignee: Alexey Serbin
>Priority: Major
> Attachments: python_test.rar
>
>
> I took python tests for kudu on arm64 platform based on 
> https://gerrit.cloudera.org/#/c/14964/   the tests failed, error info as 
> below:
> W0323 02:54:39.938022  9110 negotiation.cc:313] Failed RPC negotiation. Trace:
> 0323 02:54:39.936597 (+ 0us) reactor.cc:604] Submitting negotiation task 
> for client connection to 127.8.25.194:34669
> 0323 02:54:39.936737 (+   140us) negotiation.cc:98] Waiting for socket to 
> connect
> 0323 02:54:39.936746 (+ 9us) client_negotiation.cc:169] Beginning 
> negotiation
> 0323 02:54:39.936810 (+64us) client_negotiation.cc:246] Sending NEGOTIATE 
> NegotiatePB request
> 0323 02:54:39.937073 (+   263us) client_negotiation.cc:263] Received 
> NEGOTIATE NegotiatePB response
> 0323 02:54:39.937074 (+ 1us) client_negotiation.cc:357] Received 
> NEGOTIATE response from server
> 0323 02:54:39.937079 (+ 5us) client_negotiation.cc:184] Negotiated 
> authn=TOKEN
> 0323 02:54:39.937168 (+89us) client_negotiation.cc:473] Sending 
> TLS_HANDSHAKE message to server
> 0323 02:54:39.937171 (+ 3us) client_negotiation.cc:246] Sending 
> TLS_HANDSHAKE NegotiatePB request
> 0323 02:54:39.937724 (+   553us) client_negotiation.cc:263] Received 
> TLS_HANDSHAKE NegotiatePB response
> 0323 02:54:39.937726 (+ 2us) client_negotiation.cc:486] Received 
> TLS_HANDSHAKE response from server
> 0323 02:54:39.937906 (+   180us) negotiation.cc:304] Negotiation complete: 
> Runtime error: Client connection negotiation failed: client connection to 
> 127.8.25.194:34669: TLS Handshake error: error:1416F086:SSL 
> routines:tls_process_server_certificate:certificate verify 
> failed:../ssl/statem/statem_clnt.c:1924
> Metrics: 
> {"client-negotiator.queue_time_us":90,"thread_start_us":41,"threads_started":1}
> The python tests were successful before the commit 
> https://github.com/apache/kudu/commit/3343144fefaad5a30e95e21297c64c78e308fa1f
>  and I tried to remove this commit based on master and then the python tests 
> are success, seems the problem introduced by 
> https://github.com/apache/kudu/commit/3343144fefaad5a30e95e21297c64c78e308fa1f,
>  but I am sorry I can't fix this, could someone help me?Thanks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-3087) Python tests failed on arm64

2020-03-24 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3087:

Code Review: http://gerrit.cloudera.org:8080/15554

> Python tests failed on arm64
> 
>
> Key: KUDU-3087
> URL: https://issues.apache.org/jira/browse/KUDU-3087
> Project: Kudu
>  Issue Type: Sub-task
>Reporter: huangtianhua
>Assignee: Alexey Serbin
>Priority: Major
> Attachments: python_test.rar
>
>
> I took python tests for kudu on arm64 platform based on 
> https://gerrit.cloudera.org/#/c/14964/   the tests failed, error info as 
> below:
> W0323 02:54:39.938022  9110 negotiation.cc:313] Failed RPC negotiation. Trace:
> 0323 02:54:39.936597 (+ 0us) reactor.cc:604] Submitting negotiation task 
> for client connection to 127.8.25.194:34669
> 0323 02:54:39.936737 (+   140us) negotiation.cc:98] Waiting for socket to 
> connect
> 0323 02:54:39.936746 (+ 9us) client_negotiation.cc:169] Beginning 
> negotiation
> 0323 02:54:39.936810 (+64us) client_negotiation.cc:246] Sending NEGOTIATE 
> NegotiatePB request
> 0323 02:54:39.937073 (+   263us) client_negotiation.cc:263] Received 
> NEGOTIATE NegotiatePB response
> 0323 02:54:39.937074 (+ 1us) client_negotiation.cc:357] Received 
> NEGOTIATE response from server
> 0323 02:54:39.937079 (+ 5us) client_negotiation.cc:184] Negotiated 
> authn=TOKEN
> 0323 02:54:39.937168 (+89us) client_negotiation.cc:473] Sending 
> TLS_HANDSHAKE message to server
> 0323 02:54:39.937171 (+ 3us) client_negotiation.cc:246] Sending 
> TLS_HANDSHAKE NegotiatePB request
> 0323 02:54:39.937724 (+   553us) client_negotiation.cc:263] Received 
> TLS_HANDSHAKE NegotiatePB response
> 0323 02:54:39.937726 (+ 2us) client_negotiation.cc:486] Received 
> TLS_HANDSHAKE response from server
> 0323 02:54:39.937906 (+   180us) negotiation.cc:304] Negotiation complete: 
> Runtime error: Client connection negotiation failed: client connection to 
> 127.8.25.194:34669: TLS Handshake error: error:1416F086:SSL 
> routines:tls_process_server_certificate:certificate verify 
> failed:../ssl/statem/statem_clnt.c:1924
> Metrics: 
> {"client-negotiator.queue_time_us":90,"thread_start_us":41,"threads_started":1}
> The python tests were successful before the commit 
> https://github.com/apache/kudu/commit/3343144fefaad5a30e95e21297c64c78e308fa1f
>  and I tried to remove this commit based on master and then the python tests 
> are success, seems the problem introduced by 
> https://github.com/apache/kudu/commit/3343144fefaad5a30e95e21297c64c78e308fa1f,
>  but I am sorry I can't fix this, could someone help me?Thanks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-3087) Python tests failed on arm64

2020-03-24 Thread Alexey Serbin (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066360#comment-17066360
 ] 

Alexey Serbin commented on KUDU-3087:
-

Thank you for pinging me w.r.t. to the progress on this issue, [~huangtianhua]! 
 I posted a patch for review: http://gerrit.cloudera.org:8080/15554

> Python tests failed on arm64
> 
>
> Key: KUDU-3087
> URL: https://issues.apache.org/jira/browse/KUDU-3087
> Project: Kudu
>  Issue Type: Sub-task
>Reporter: huangtianhua
>Assignee: Alexey Serbin
>Priority: Major
> Attachments: python_test.rar
>
>
> I took python tests for kudu on arm64 platform based on 
> https://gerrit.cloudera.org/#/c/14964/   the tests failed, error info as 
> below:
> W0323 02:54:39.938022  9110 negotiation.cc:313] Failed RPC negotiation. Trace:
> 0323 02:54:39.936597 (+ 0us) reactor.cc:604] Submitting negotiation task 
> for client connection to 127.8.25.194:34669
> 0323 02:54:39.936737 (+   140us) negotiation.cc:98] Waiting for socket to 
> connect
> 0323 02:54:39.936746 (+ 9us) client_negotiation.cc:169] Beginning 
> negotiation
> 0323 02:54:39.936810 (+64us) client_negotiation.cc:246] Sending NEGOTIATE 
> NegotiatePB request
> 0323 02:54:39.937073 (+   263us) client_negotiation.cc:263] Received 
> NEGOTIATE NegotiatePB response
> 0323 02:54:39.937074 (+ 1us) client_negotiation.cc:357] Received 
> NEGOTIATE response from server
> 0323 02:54:39.937079 (+ 5us) client_negotiation.cc:184] Negotiated 
> authn=TOKEN
> 0323 02:54:39.937168 (+89us) client_negotiation.cc:473] Sending 
> TLS_HANDSHAKE message to server
> 0323 02:54:39.937171 (+ 3us) client_negotiation.cc:246] Sending 
> TLS_HANDSHAKE NegotiatePB request
> 0323 02:54:39.937724 (+   553us) client_negotiation.cc:263] Received 
> TLS_HANDSHAKE NegotiatePB response
> 0323 02:54:39.937726 (+ 2us) client_negotiation.cc:486] Received 
> TLS_HANDSHAKE response from server
> 0323 02:54:39.937906 (+   180us) negotiation.cc:304] Negotiation complete: 
> Runtime error: Client connection negotiation failed: client connection to 
> 127.8.25.194:34669: TLS Handshake error: error:1416F086:SSL 
> routines:tls_process_server_certificate:certificate verify 
> failed:../ssl/statem/statem_clnt.c:1924
> Metrics: 
> {"client-negotiator.queue_time_us":90,"thread_start_us":41,"threads_started":1}
> The python tests were successful before the commit 
> https://github.com/apache/kudu/commit/3343144fefaad5a30e95e21297c64c78e308fa1f
>  and I tried to remove this commit based on master and then the python tests 
> are success, seems the problem introduced by 
> https://github.com/apache/kudu/commit/3343144fefaad5a30e95e21297c64c78e308fa1f,
>  but I am sorry I can't fix this, could someone help me?Thanks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-3087) Python tests failed on arm64

2020-03-23 Thread Alexey Serbin (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17065046#comment-17065046
 ] 

Alexey Serbin commented on KUDU-3087:
-

[~huangtianhua] what Linux distro are you using to run those python tests?

> Python tests failed on arm64
> 
>
> Key: KUDU-3087
> URL: https://issues.apache.org/jira/browse/KUDU-3087
> Project: Kudu
>  Issue Type: Sub-task
>Reporter: huangtianhua
>Assignee: Alexey Serbin
>Priority: Major
> Attachments: python_test.rar
>
>
> I took python tests for kudu on arm64 platform based on 
> https://gerrit.cloudera.org/#/c/14964/   the tests failed, error info as 
> below:
> W0323 02:54:39.938022  9110 negotiation.cc:313] Failed RPC negotiation. Trace:
> 0323 02:54:39.936597 (+ 0us) reactor.cc:604] Submitting negotiation task 
> for client connection to 127.8.25.194:34669
> 0323 02:54:39.936737 (+   140us) negotiation.cc:98] Waiting for socket to 
> connect
> 0323 02:54:39.936746 (+ 9us) client_negotiation.cc:169] Beginning 
> negotiation
> 0323 02:54:39.936810 (+64us) client_negotiation.cc:246] Sending NEGOTIATE 
> NegotiatePB request
> 0323 02:54:39.937073 (+   263us) client_negotiation.cc:263] Received 
> NEGOTIATE NegotiatePB response
> 0323 02:54:39.937074 (+ 1us) client_negotiation.cc:357] Received 
> NEGOTIATE response from server
> 0323 02:54:39.937079 (+ 5us) client_negotiation.cc:184] Negotiated 
> authn=TOKEN
> 0323 02:54:39.937168 (+89us) client_negotiation.cc:473] Sending 
> TLS_HANDSHAKE message to server
> 0323 02:54:39.937171 (+ 3us) client_negotiation.cc:246] Sending 
> TLS_HANDSHAKE NegotiatePB request
> 0323 02:54:39.937724 (+   553us) client_negotiation.cc:263] Received 
> TLS_HANDSHAKE NegotiatePB response
> 0323 02:54:39.937726 (+ 2us) client_negotiation.cc:486] Received 
> TLS_HANDSHAKE response from server
> 0323 02:54:39.937906 (+   180us) negotiation.cc:304] Negotiation complete: 
> Runtime error: Client connection negotiation failed: client connection to 
> 127.8.25.194:34669: TLS Handshake error: error:1416F086:SSL 
> routines:tls_process_server_certificate:certificate verify 
> failed:../ssl/statem/statem_clnt.c:1924
> Metrics: 
> {"client-negotiator.queue_time_us":90,"thread_start_us":41,"threads_started":1}
> The python tests were successful before the commit 
> https://github.com/apache/kudu/commit/3343144fefaad5a30e95e21297c64c78e308fa1f
>  and I tried to remove this commit based on master and then the python tests 
> are success, seems the problem introduced by 
> https://github.com/apache/kudu/commit/3343144fefaad5a30e95e21297c64c78e308fa1f,
>  but I am sorry I can't fix this, could someone help me?Thanks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (KUDU-3087) Python tests failed on arm64

2020-03-23 Thread Alexey Serbin (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17065025#comment-17065025
 ] 

Alexey Serbin edited comment on KUDU-3087 at 3/23/20, 7:48 PM:
---

Sure, I'll take a look.

I suspect 768-bit crypto is considered too weak for OpenSSL security level 1 
(we set the security level to 0 in our other tests), where security level 1 is 
the default for OpenSSL 1.1.x on older Linux distros (for CentOS8.1 the default 
OpenSSL security level is 2).


was (Author: aserbin):
Sure, I'll take a look.

I suspect 768-bit crypto is considered too weak for OpenSSL security level 1 
(we set the security level to 0 in our other tests), where security level 1 is 
the default for OpenSSL prior to 1.1.0 version.

> Python tests failed on arm64
> 
>
> Key: KUDU-3087
> URL: https://issues.apache.org/jira/browse/KUDU-3087
> Project: Kudu
>  Issue Type: Sub-task
>Reporter: huangtianhua
>Assignee: Alexey Serbin
>Priority: Major
> Attachments: python_test.rar
>
>
> I took python tests for kudu on arm64 platform based on 
> https://gerrit.cloudera.org/#/c/14964/   the tests failed, error info as 
> below:
> W0323 02:54:39.938022  9110 negotiation.cc:313] Failed RPC negotiation. Trace:
> 0323 02:54:39.936597 (+ 0us) reactor.cc:604] Submitting negotiation task 
> for client connection to 127.8.25.194:34669
> 0323 02:54:39.936737 (+   140us) negotiation.cc:98] Waiting for socket to 
> connect
> 0323 02:54:39.936746 (+ 9us) client_negotiation.cc:169] Beginning 
> negotiation
> 0323 02:54:39.936810 (+64us) client_negotiation.cc:246] Sending NEGOTIATE 
> NegotiatePB request
> 0323 02:54:39.937073 (+   263us) client_negotiation.cc:263] Received 
> NEGOTIATE NegotiatePB response
> 0323 02:54:39.937074 (+ 1us) client_negotiation.cc:357] Received 
> NEGOTIATE response from server
> 0323 02:54:39.937079 (+ 5us) client_negotiation.cc:184] Negotiated 
> authn=TOKEN
> 0323 02:54:39.937168 (+89us) client_negotiation.cc:473] Sending 
> TLS_HANDSHAKE message to server
> 0323 02:54:39.937171 (+ 3us) client_negotiation.cc:246] Sending 
> TLS_HANDSHAKE NegotiatePB request
> 0323 02:54:39.937724 (+   553us) client_negotiation.cc:263] Received 
> TLS_HANDSHAKE NegotiatePB response
> 0323 02:54:39.937726 (+ 2us) client_negotiation.cc:486] Received 
> TLS_HANDSHAKE response from server
> 0323 02:54:39.937906 (+   180us) negotiation.cc:304] Negotiation complete: 
> Runtime error: Client connection negotiation failed: client connection to 
> 127.8.25.194:34669: TLS Handshake error: error:1416F086:SSL 
> routines:tls_process_server_certificate:certificate verify 
> failed:../ssl/statem/statem_clnt.c:1924
> Metrics: 
> {"client-negotiator.queue_time_us":90,"thread_start_us":41,"threads_started":1}
> The python tests were successful before the commit 
> https://github.com/apache/kudu/commit/3343144fefaad5a30e95e21297c64c78e308fa1f
>  and I tried to remove this commit based on master and then the python tests 
> are success, seems the problem introduced by 
> https://github.com/apache/kudu/commit/3343144fefaad5a30e95e21297c64c78e308fa1f,
>  but I am sorry I can't fix this, could someone help me?Thanks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-3087) Python tests failed on arm64

2020-03-23 Thread Alexey Serbin (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17065025#comment-17065025
 ] 

Alexey Serbin commented on KUDU-3087:
-

Sure, I'll take a look.

I suspect 768-bit crypto is considered too weak for OpenSSL security level 1 
(we set the security level to 0 in our other tests), where security level 1 is 
the default for OpenSSL prior to 1.1.0 version.

> Python tests failed on arm64
> 
>
> Key: KUDU-3087
> URL: https://issues.apache.org/jira/browse/KUDU-3087
> Project: Kudu
>  Issue Type: Sub-task
>Reporter: huangtianhua
>Assignee: Alexey Serbin
>Priority: Major
> Attachments: python_test.rar
>
>
> I took python tests for kudu on arm64 platform based on 
> https://gerrit.cloudera.org/#/c/14964/   the tests failed, error info as 
> below:
> W0323 02:54:39.938022  9110 negotiation.cc:313] Failed RPC negotiation. Trace:
> 0323 02:54:39.936597 (+ 0us) reactor.cc:604] Submitting negotiation task 
> for client connection to 127.8.25.194:34669
> 0323 02:54:39.936737 (+   140us) negotiation.cc:98] Waiting for socket to 
> connect
> 0323 02:54:39.936746 (+ 9us) client_negotiation.cc:169] Beginning 
> negotiation
> 0323 02:54:39.936810 (+64us) client_negotiation.cc:246] Sending NEGOTIATE 
> NegotiatePB request
> 0323 02:54:39.937073 (+   263us) client_negotiation.cc:263] Received 
> NEGOTIATE NegotiatePB response
> 0323 02:54:39.937074 (+ 1us) client_negotiation.cc:357] Received 
> NEGOTIATE response from server
> 0323 02:54:39.937079 (+ 5us) client_negotiation.cc:184] Negotiated 
> authn=TOKEN
> 0323 02:54:39.937168 (+89us) client_negotiation.cc:473] Sending 
> TLS_HANDSHAKE message to server
> 0323 02:54:39.937171 (+ 3us) client_negotiation.cc:246] Sending 
> TLS_HANDSHAKE NegotiatePB request
> 0323 02:54:39.937724 (+   553us) client_negotiation.cc:263] Received 
> TLS_HANDSHAKE NegotiatePB response
> 0323 02:54:39.937726 (+ 2us) client_negotiation.cc:486] Received 
> TLS_HANDSHAKE response from server
> 0323 02:54:39.937906 (+   180us) negotiation.cc:304] Negotiation complete: 
> Runtime error: Client connection negotiation failed: client connection to 
> 127.8.25.194:34669: TLS Handshake error: error:1416F086:SSL 
> routines:tls_process_server_certificate:certificate verify 
> failed:../ssl/statem/statem_clnt.c:1924
> Metrics: 
> {"client-negotiator.queue_time_us":90,"thread_start_us":41,"threads_started":1}
> The python tests were successful before the commit 
> https://github.com/apache/kudu/commit/3343144fefaad5a30e95e21297c64c78e308fa1f
>  and I tried to remove this commit based on master and then the python tests 
> are success, seems the problem introduced by 
> https://github.com/apache/kudu/commit/3343144fefaad5a30e95e21297c64c78e308fa1f,
>  but I am sorry I can't fix this, could someone help me?Thanks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KUDU-3084) Multiple time sources with fallback behavior between them

2020-03-19 Thread Alexey Serbin (Jira)
Alexey Serbin created KUDU-3084:
---

 Summary: Multiple time sources with fallback behavior between them
 Key: KUDU-3084
 URL: https://issues.apache.org/jira/browse/KUDU-3084
 Project: Kudu
  Issue Type: Improvement
  Components: master, tserver
Reporter: Alexey Serbin


[~tlipcon] suggested an alternative approach to configure and select 
HybridClock's time source.

Kudu servers could maintain multiple time sources and switch between them with 
a fallback behavior.  The default or preferred time source might be any of the 
existing ones (e.g., the built-in client), but when it's not available, another 
available time source is selected (e.g., {{system}} -- the NTP-synchronized 
local clock).  Switching between time sources can be done:
* only upon startup/initialization
* upon startup/initialization and later during normal run time

The advantages are:
* easier deployment and configuration of Kudu clusters
* simplified upgrade path from older releases using {{system}} time source to 
newer releases using {{builtin}} time source by default

There are downsides, though.  Since the new way of maintaining time source is 
more dynamic, it can:
* mask various configuration or network issues
* result in different time source within the same Kudu cluster due to transient 
issues
* introduce extra startup delay



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KUDU-2928) built-in NTP client: tests to evaluate the behavior of the client

2020-03-19 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin resolved KUDU-2928.
-
Fix Version/s: 1.12.0
   Resolution: Fixed

Implemented with {{4aa0c7c0bc7d91af8be9a837b64f2a53fe31dd44}}

> built-in NTP client: tests to evaluate the behavior of the client
> -
>
> Key: KUDU-2928
> URL: https://issues.apache.org/jira/browse/KUDU-2928
> Project: Kudu
>  Issue Type: Sub-task
>  Components: clock, test
>Affects Versions: 1.11.0
>Reporter: Alexey Serbin
>Assignee: Alexey Serbin
>Priority: Major
>  Labels: clock
> Fix For: 1.12.0
>
>
> It's necessary to implement tests covering the behavior of the built-in NTP 
> client in various corner cases:
> * A set of NTP servers which doesn't agree on time
> * non-synchronized NTP server
> * NTP server that loses track of its reference and becomes a false ticker
> * etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-3082) tablets in "CONSENSUS_MISMATCH" state for a long time

2020-03-19 Thread Alexey Serbin (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17062763#comment-17062763
 ] 

Alexey Serbin commented on KUDU-3082:
-

[~zhangyifan27], what Kudu version is that?

> tablets in "CONSENSUS_MISMATCH" state for a long time
> -
>
> Key: KUDU-3082
> URL: https://issues.apache.org/jira/browse/KUDU-3082
> Project: Kudu
>  Issue Type: Bug
>Reporter: YifanZhang
>Priority: Major
>
> Lately we found a few tablets in one of our clusters are unhealthy, the ksck 
> output is like:
>  
> {code:java}
> Tablet Summary
> Tablet 7404240f458f462d92b6588d07583a52 of table '' is conflicted: 3 
> replicas' active configs disagree with the leader master's
>   7380d797d2ea49e88d71091802fb1c81 (kudu-ts26): RUNNING
>   d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING
>   47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
> All reported replicas are:
>   A = 7380d797d2ea49e88d71091802fb1c81
>   B = d1952499f94a4e6087bee28466fcb09f
>   C = 47af52df1adc47e1903eb097e9c88f2e
>   D = 08beca5ed4d04003b6979bf8bac378d2
> The consensus matrix is:
>  Config source | Replicas | Current term | Config index | Committed?
> ---+--+--+--+
>  master| A   B   C*   |  |  | Yes
>  A | A   B   C*   | 5| -1   | Yes
>  B | A   B   C| 5| -1   | Yes
>  C | A   B   C*  D~   | 5| 54649| No
> Tablet 6d9d3fb034314fa7bee9cfbf602bcdc8 of table '' is conflicted: 2 
> replicas' active configs disagree with the leader master's
>   d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING
>   47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
>   5a8aeadabdd140c29a09dabcae919b31 (kudu-ts21): RUNNING
> All reported replicas are:
>   A = d1952499f94a4e6087bee28466fcb09f
>   B = 47af52df1adc47e1903eb097e9c88f2e
>   C = 5a8aeadabdd140c29a09dabcae919b31
>   D = 14632cdbb0d04279bc772f64e06389f9
> The consensus matrix is:
>  Config source | Replicas | Current term | Config index | Committed?
> ---+--+--+--+
>  master| A   B*  C|  |  | Yes
>  A | A   B*  C| 5| 5| Yes
>  B | A   B*  C   D~   | 5| 96176| No
>  C | A   B*  C| 5| 5| Yes
> Tablet bf1ec7d693b94632b099dc0928e76363 of table '' is conflicted: 1 
> replicas' active configs disagree with the leader master's
>   a9eaff3cf1ed483aae84954d649a (kudu-ts23): RUNNING
>   f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING
>   47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
> All reported replicas are:
>   A = a9eaff3cf1ed483aae84954d649a
>   B = f75df4a6b5ce404884313af5f906b392
>   C = 47af52df1adc47e1903eb097e9c88f2e
>   D = d1952499f94a4e6087bee28466fcb09f
> The consensus matrix is:
>  Config source | Replicas | Current term | Config index | Committed?
> ---+--+--+--+
>  master| A   B   C*   |  |  | Yes
>  A | A   B   C*   | 1| -1   | Yes
>  B | A   B   C*   | 1| -1   | Yes
>  C | A   B   C*  D~   | 1| 2| No
> Tablet 3190a310857e4c64997adb477131488a of table '' is conflicted: 3 
> replicas' active configs disagree with the leader master's
>   47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
>   f0f7b2f4b9d344e6929105f48365f38e (kudu-ts24): RUNNING
>   f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING
> All reported replicas are:
>   A = 47af52df1adc47e1903eb097e9c88f2e
>   B = f0f7b2f4b9d344e6929105f48365f38e
>   C = f75df4a6b5ce404884313af5f906b392
>   D = d1952499f94a4e6087bee28466fcb09f
> The consensus matrix is:
>  Config source | Replicas | Current term | Config index | Committed?
> ---+--+--+--+
>  master| A*  B   C|  |  | Yes
>  A | A*  B   C   D~   | 1| 1991 | No
>  B | A*  B   C| 1| 4| Yes
>  C | A*  B   C| 1| 4| Yes{code}
> These tablets couldn't recover for a couple of days until we restart 
> kudu-ts27.
> I found so many duplicated logs in kudu-ts27 are like:
> {code:java}
> I0314 04:38:41.511279 65731 raft_consensus.cc:937] T 
> 7404240f458f462d92b6588d07583a52 P 47af52df1adc47e1903eb097e9c88f2e [term 3 
> LEAD

[jira] [Updated] (KUDU-3067) Inexplict cloud detection for AWS and OpenStack based cloud by querying metadata

2020-03-19 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3067:

Fix Version/s: 1.12.0
   Resolution: Fixed
   Status: Resolved  (was: In Review)

> Inexplict cloud detection for AWS and OpenStack based cloud by querying 
> metadata
> 
>
> Key: KUDU-3067
> URL: https://issues.apache.org/jira/browse/KUDU-3067
> Project: Kudu
>  Issue Type: Bug
>Reporter: liusheng
>Assignee: Alexey Serbin
>Priority: Major
> Fix For: 1.12.0
>
>
> The cloud detector is used to check the cloud provider of the instance, see 
> [here|#L59-L93]],  For AWS cloud it using the URL 
> [http://169.254.169.254/latest/meta-data/instance-id|http://169.254.169.254/latest/meta-data/instance-id*]
>  to check the specific metadata to determine it is AWS instance. This is OK, 
> but for OpenStack based cloud, the metadata is same with AWS, so this URL can 
> also be accessed. So this cannot distinct the AWS and other OpenStack based 
> clouds. This caused an issue when run 
> "HybridClockTest.TimeSourceAutoSelection" test case, this test will use the 
> above URL to detect the Cloud of instance current running on and then try to 
> call the NTP service, for AWS, the dedicated NTP service is 
> "169.254.169.123", but for OpenStack based cloud, there isn't such a 
> dedicated NTP service. So this test case will fail if I run on a instance of 
> OpenStack based cloud because the cloud detector suppose it is AWS instance 
> and try to access "169.254.169.123".
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-3067) Inexplict cloud detection for AWS and OpenStack based cloud by querying metadata

2020-03-18 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3067:

Code Review: http://gerrit.cloudera.org:8080/15488

[~seanlau], could you verify that the fix published at 
http://gerrit.cloudera.org:8080/15488 works as expected?

> Inexplict cloud detection for AWS and OpenStack based cloud by querying 
> metadata
> 
>
> Key: KUDU-3067
> URL: https://issues.apache.org/jira/browse/KUDU-3067
> Project: Kudu
>  Issue Type: Bug
>Reporter: liusheng
>Assignee: Alexey Serbin
>Priority: Major
>
> The cloud detector is used to check the cloud provider of the instance, see 
> [here|#L59-L93]],  For AWS cloud it using the URL 
> [http://169.254.169.254/latest/meta-data/instance-id|http://169.254.169.254/latest/meta-data/instance-id*]
>  to check the specific metadata to determine it is AWS instance. This is OK, 
> but for OpenStack based cloud, the metadata is same with AWS, so this URL can 
> also be accessed. So this cannot distinct the AWS and other OpenStack based 
> clouds. This caused an issue when run 
> "HybridClockTest.TimeSourceAutoSelection" test case, this test will use the 
> above URL to detect the Cloud of instance current running on and then try to 
> call the NTP service, for AWS, the dedicated NTP service is 
> "169.254.169.123", but for OpenStack based cloud, there isn't such a 
> dedicated NTP service. So this test case will fail if I run on a instance of 
> OpenStack based cloud because the cloud detector suppose it is AWS instance 
> and try to access "169.254.169.123".
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-3067) Inexplict cloud detection for AWS and OpenStack based cloud by querying metadata

2020-03-18 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3067:

Status: In Review  (was: In Progress)

> Inexplict cloud detection for AWS and OpenStack based cloud by querying 
> metadata
> 
>
> Key: KUDU-3067
> URL: https://issues.apache.org/jira/browse/KUDU-3067
> Project: Kudu
>  Issue Type: Bug
>Reporter: liusheng
>Assignee: Alexey Serbin
>Priority: Major
>
> The cloud detector is used to check the cloud provider of the instance, see 
> [here|#L59-L93]],  For AWS cloud it using the URL 
> [http://169.254.169.254/latest/meta-data/instance-id|http://169.254.169.254/latest/meta-data/instance-id*]
>  to check the specific metadata to determine it is AWS instance. This is OK, 
> but for OpenStack based cloud, the metadata is same with AWS, so this URL can 
> also be accessed. So this cannot distinct the AWS and other OpenStack based 
> clouds. This caused an issue when run 
> "HybridClockTest.TimeSourceAutoSelection" test case, this test will use the 
> above URL to detect the Cloud of instance current running on and then try to 
> call the NTP service, for AWS, the dedicated NTP service is 
> "169.254.169.123", but for OpenStack based cloud, there isn't such a 
> dedicated NTP service. So this test case will fail if I run on a instance of 
> OpenStack based cloud because the cloud detector suppose it is AWS instance 
> and try to access "169.254.169.123".
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-3067) Inexplict cloud detection for AWS and OpenStack based cloud by querying metadata

2020-03-18 Thread Alexey Serbin (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17061907#comment-17061907
 ] 

Alexey Serbin commented on KUDU-3067:
-

[~seanlau], the main thing that blocks me here is an access to an OpenStack 
cloud instance.  If I put together a WIP patch for review at 
gerrit.cloudera.org, would you be able to verify it works for you?

Also, if you could run the following curl command at one of your instances and 
post back the output, it would be great:

{{curl -v http://169.254.169.254/latest/meta-data/instance-id}}

Also, do you have an account at #kudu-general slack channel?  Maybe, we can 
sync up over Slack on that.

Thank you!

> Inexplict cloud detection for AWS and OpenStack based cloud by querying 
> metadata
> 
>
> Key: KUDU-3067
> URL: https://issues.apache.org/jira/browse/KUDU-3067
> Project: Kudu
>  Issue Type: Bug
>Reporter: liusheng
>Assignee: Alexey Serbin
>Priority: Major
>
> The cloud detector is used to check the cloud provider of the instance, see 
> [here|#L59-L93]],  For AWS cloud it using the URL 
> [http://169.254.169.254/latest/meta-data/instance-id|http://169.254.169.254/latest/meta-data/instance-id*]
>  to check the specific metadata to determine it is AWS instance. This is OK, 
> but for OpenStack based cloud, the metadata is same with AWS, so this URL can 
> also be accessed. So this cannot distinct the AWS and other OpenStack based 
> clouds. This caused an issue when run 
> "HybridClockTest.TimeSourceAutoSelection" test case, this test will use the 
> above URL to detect the Cloud of instance current running on and then try to 
> call the NTP service, for AWS, the dedicated NTP service is 
> "169.254.169.123", but for OpenStack based cloud, there isn't such a 
> dedicated NTP service. So this test case will fail if I run on a instance of 
> OpenStack based cloud because the cloud detector suppose it is AWS instance 
> and try to access "169.254.169.123".
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-3067) Inexplict cloud detection for AWS and OpenStack based cloud by querying metadata

2020-03-18 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3067:

Status: Open  (was: In Review)

> Inexplict cloud detection for AWS and OpenStack based cloud by querying 
> metadata
> 
>
> Key: KUDU-3067
> URL: https://issues.apache.org/jira/browse/KUDU-3067
> Project: Kudu
>  Issue Type: Bug
>Reporter: liusheng
>Assignee: Alexey Serbin
>Priority: Major
>
> The cloud detector is used to check the cloud provider of the instance, see 
> [here|#L59-L93]],  For AWS cloud it using the URL 
> [http://169.254.169.254/latest/meta-data/instance-id|http://169.254.169.254/latest/meta-data/instance-id*]
>  to check the specific metadata to determine it is AWS instance. This is OK, 
> but for OpenStack based cloud, the metadata is same with AWS, so this URL can 
> also be accessed. So this cannot distinct the AWS and other OpenStack based 
> clouds. This caused an issue when run 
> "HybridClockTest.TimeSourceAutoSelection" test case, this test will use the 
> above URL to detect the Cloud of instance current running on and then try to 
> call the NTP service, for AWS, the dedicated NTP service is 
> "169.254.169.123", but for OpenStack based cloud, there isn't such a 
> dedicated NTP service. So this test case will fail if I run on a instance of 
> OpenStack based cloud because the cloud detector suppose it is AWS instance 
> and try to access "169.254.169.123".
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-3067) Inexplict cloud detection for AWS and OpenStack based cloud by querying metadata

2020-03-18 Thread Alexey Serbin (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17061869#comment-17061869
 ] 

Alexey Serbin commented on KUDU-3067:
-

Hi [~seanlau],

Sure -- looking.

> Inexplict cloud detection for AWS and OpenStack based cloud by querying 
> metadata
> 
>
> Key: KUDU-3067
> URL: https://issues.apache.org/jira/browse/KUDU-3067
> Project: Kudu
>  Issue Type: Bug
>Reporter: liusheng
>Assignee: Alexey Serbin
>Priority: Major
>
> The cloud detector is used to check the cloud provider of the instance, see 
> [here|#L59-L93]],  For AWS cloud it using the URL 
> [http://169.254.169.254/latest/meta-data/instance-id|http://169.254.169.254/latest/meta-data/instance-id*]
>  to check the specific metadata to determine it is AWS instance. This is OK, 
> but for OpenStack based cloud, the metadata is same with AWS, so this URL can 
> also be accessed. So this cannot distinct the AWS and other OpenStack based 
> clouds. This caused an issue when run 
> "HybridClockTest.TimeSourceAutoSelection" test case, this test will use the 
> above URL to detect the Cloud of instance current running on and then try to 
> call the NTP service, for AWS, the dedicated NTP service is 
> "169.254.169.123", but for OpenStack based cloud, there isn't such a 
> dedicated NTP service. So this test case will fail if I run on a instance of 
> OpenStack based cloud because the cloud detector suppose it is AWS instance 
> and try to access "169.254.169.123".
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-3067) Inexplict cloud detection for AWS and OpenStack based cloud by querying metadata

2020-03-18 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3067:

Status: In Review  (was: Open)

> Inexplict cloud detection for AWS and OpenStack based cloud by querying 
> metadata
> 
>
> Key: KUDU-3067
> URL: https://issues.apache.org/jira/browse/KUDU-3067
> Project: Kudu
>  Issue Type: Bug
>Reporter: liusheng
>Assignee: Alexey Serbin
>Priority: Major
>
> The cloud detector is used to check the cloud provider of the instance, see 
> [here|#L59-L93]],  For AWS cloud it using the URL 
> [http://169.254.169.254/latest/meta-data/instance-id|http://169.254.169.254/latest/meta-data/instance-id*]
>  to check the specific metadata to determine it is AWS instance. This is OK, 
> but for OpenStack based cloud, the metadata is same with AWS, so this URL can 
> also be accessed. So this cannot distinct the AWS and other OpenStack based 
> clouds. This caused an issue when run 
> "HybridClockTest.TimeSourceAutoSelection" test case, this test will use the 
> above URL to detect the Cloud of instance current running on and then try to 
> call the NTP service, for AWS, the dedicated NTP service is 
> "169.254.169.123", but for OpenStack based cloud, there isn't such a 
> dedicated NTP service. So this test case will fail if I run on a instance of 
> OpenStack based cloud because the cloud detector suppose it is AWS instance 
> and try to access "169.254.169.123".
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KUDU-3075) SubprocessServerTest.TestTimeoutWhileQueueingCalls sometimes fails

2020-03-13 Thread Alexey Serbin (Jira)
Alexey Serbin created KUDU-3075:
---

 Summary: SubprocessServerTest.TestTimeoutWhileQueueingCalls 
sometimes fails
 Key: KUDU-3075
 URL: https://issues.apache.org/jira/browse/KUDU-3075
 Project: Kudu
  Issue Type: Bug
  Components: test
Affects Versions: 1.12.0
Reporter: Alexey Serbin
 Attachments: subprocess_server-test.txt.xz

The test scenario sometimes fails like below (at least in TSAN builds):

{noformat}
W0314 03:52:48.979025  3014 server.h:119] failed to send request: End of file: 
unable to send message: Other end of pipe was closed 
src/kudu/subprocess/subprocess_server-test.cc:233: Failure
Value of: has_timeout_when_queueing 
  Actual: false 
Expected: true  
expected at least one timeout   
{noformat}

The log is attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KUDU-3073) BuiltinNtpWithMiniChronydTest.SyncAndUnsyncReferenceServers sometimes fails

2020-03-12 Thread Alexey Serbin (Jira)
Alexey Serbin created KUDU-3073:
---

 Summary: 
BuiltinNtpWithMiniChronydTest.SyncAndUnsyncReferenceServers sometimes fails
 Key: KUDU-3073
 URL: https://issues.apache.org/jira/browse/KUDU-3073
 Project: Kudu
  Issue Type: Bug
Affects Versions: 1.12.0
Reporter: Alexey Serbin
 Attachments: ntp-test.txt.xz

{noformat}
src/kudu/clock/ntp-test.cc:478: Failure
Value of: s.IsRuntimeError()
  Actual: false 
Expected: true  
OK  
src/kudu/clock/ntp-test.cc:595: Failure
Expected: CheckNoNtpSource(sync_servers_refs) doesn't generate new fatal 
failures in the current thread.
  Actual: it does. 
{noformat}

The log is attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (KUDU-2780) Rebalance Kudu cluster in background

2020-03-06 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-2780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin reassigned KUDU-2780:
---

 Code Review: https://gerrit.cloudera.org/#/c/14177/
 Component/s: master
Target Version/s: 1.12.0
Assignee: Hannah Nguyen  (was: Alexey Serbin)

> Rebalance Kudu cluster in background
> 
>
> Key: KUDU-2780
> URL: https://issues.apache.org/jira/browse/KUDU-2780
> Project: Kudu
>  Issue Type: Improvement
>  Components: master
>Reporter: Alexey Serbin
>Assignee: Hannah Nguyen
>Priority: Major
>  Labels: roadmap-candidate
>
> With the introduction of `kudu cluster rebalance` CLI tool it's possible to 
> balance the distribution of tablet replicas in a Kudu cluster.  However, that 
> tool should be run manually or via an external scheduler (e.g. cron).
> It would be nice if Kudu would track and correct imbalances of replica 
> distribution automatically.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (KUDU-2780) Rebalance Kudu cluster in background

2020-03-06 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-2780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin reassigned KUDU-2780:
---

Assignee: Alexey Serbin  (was: Hannah Nguyen)

> Rebalance Kudu cluster in background
> 
>
> Key: KUDU-2780
> URL: https://issues.apache.org/jira/browse/KUDU-2780
> Project: Kudu
>  Issue Type: Improvement
>Reporter: Alexey Serbin
>Assignee: Alexey Serbin
>Priority: Major
>  Labels: roadmap-candidate
>
> With the introduction of `kudu cluster rebalance` CLI tool it's possible to 
> balance the distribution of tablet replicas in a Kudu cluster.  However, that 
> tool should be run manually or via an external scheduler (e.g. cron).
> It would be nice if Kudu would track and correct imbalances of replica 
> distribution automatically.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (KUDU-3067) Inexplict cloud detection for AWS and OpenStack based cloud by querying metadata

2020-03-05 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin reassigned KUDU-3067:
---

Assignee: Alexey Serbin

> Inexplict cloud detection for AWS and OpenStack based cloud by querying 
> metadata
> 
>
> Key: KUDU-3067
> URL: https://issues.apache.org/jira/browse/KUDU-3067
> Project: Kudu
>  Issue Type: Bug
>Reporter: liusheng
>Assignee: Alexey Serbin
>Priority: Major
>
> The cloud detector is used to check the cloud provider of the instance, see 
> [here|#L59-L93]],  For AWS cloud it using the URL 
> [http://169.254.169.254/latest/meta-data/instance-id|http://169.254.169.254/latest/meta-data/instance-id*]
>  to check the specific metadata to determine it is AWS instance. This is OK, 
> but for OpenStack based cloud, the metadata is same with AWS, so this URL can 
> also be accessed. So this cannot distinct the AWS and other OpenStack based 
> clouds. This caused an issue when run 
> "HybridClockTest.TimeSourceAutoSelection" test case, this test will use the 
> above URL to detect the Cloud of instance current running on and then try to 
> call the NTP service, for AWS, the dedicated NTP service is 
> "169.254.169.123", but for OpenStack based cloud, there isn't such a 
> dedicated NTP service. So this test case will fail if I run on a instance of 
> OpenStack based cloud because the cloud detector suppose it is AWS instance 
> and try to access "169.254.169.123".
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KUDU-3065) MasterReplicationAndRpcSizeLimitTest.TabletReports sometimes fails

2020-03-02 Thread Alexey Serbin (Jira)
Alexey Serbin created KUDU-3065:
---

 Summary: MasterReplicationAndRpcSizeLimitTest.TabletReports 
sometimes fails
 Key: KUDU-3065
 URL: https://issues.apache.org/jira/browse/KUDU-3065
 Project: Kudu
  Issue Type: Bug
  Components: test
Affects Versions: 1.12.0
Reporter: Alexey Serbin
 Attachments: master_replication-itest.txt.xz

The {{MasterReplicationAndRpcSizeLimitTest.TabletReports}} scenario sometimes 
fails with an error like below:

{noformat}
F0228 19:57:29.566195   236 test_workload.cc:330] Timed out: Timed out waiting 
for Table Creation
*** Check failure stack trace: ***
*** Aborted at 1582919849 (unix time) try "date -d @1582919849" if you are 
using GNU date ***
PC: @ 0x7f62a6483c37 gsignal
*** SIGABRT (@0x3e800ec) received by PID 236 (TID 0x7f62ab9ff8c0) from PID 
236; stack trace: ***
@ 0x7f62a94ca330 (unknown) at ??:0
@ 0x7f62a6483c37 gsignal at ??:0
@ 0x7f62a6487028 abort at ??:0
@ 0x7f62a74f2e09 google::logging_fail() at ??:0
@ 0x7f62a74f462d google::LogMessage::Fail() at ??:0
@ 0x7f62a74f664c google::LogMessage::SendToLog() at ??:0
@ 0x7f62a74f4189 google::LogMessage::Flush() at ??:0
@ 0x7f62a74f6fdf google::LogMessageFatal::~LogMessageFatal() at ??:0
@ 0x7f62ab5bf008 kudu::TestWorkload::Setup() at ??:0
@   0x431230 
kudu::master::MasterReplicationAndRpcSizeLimitTest_TabletReports_Test::TestBody()
 at src/kudu/integration-tests/master_replication-itest.cc:597
@ 0x7f62a851ab89 
testing::internal::HandleExceptionsInMethodIfSupported<>() at ??:0
@ 0x7f62a850b68f testing::Test::Run() at ??:0
@ 0x7f62a850b74d testing::TestInfo::Run() at ??:0
@ 0x7f62a850b865 testing::TestCase::Run() at ??:0
@ 0x7f62a850bb28 testing::internal::UnitTestImpl::RunAllTests() at ??:0
@ 0x7f62a850bdc9 testing::UnitTest::Run() at ??:0
@ 0x7f62ab297cf3 RUN_ALL_TESTS() at ??:0
@ 0x7f62ab295cab main at ??:0
@ 0x7f62a646ef45 __libc_start_main at ??:0
@   0x42a999 (unknown) at ??:?
W0228 19:57:29.694170  6801 catalog_manager.cc:4593] T 
 P d48d9641d5694fa783f13d0dce1ad5a9: Tablet 
0113c6c5ccda4639a20e6cda5ce67acf (table test-workload 
[id=1f40b8e5e9454964ad4380766e8e2382]) was not created within the allowed 
timeout. Replacing with a new tablet 85e9b58f383a48cd807f7a49c6190867
{noformat}

The log is attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-2573) Fully support Chrony in place of NTP

2020-02-28 Thread Alexey Serbin (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-2573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17048012#comment-17048012
 ] 

Alexey Serbin commented on KUDU-2573:
-

After running several Kudu clusters at machines whose local clock is 
synchronized with {{chronyd}}, I think we can declare that it's safe to use 
{{chrony}} version 3.4 and newer instead of {{ntp}} on Kudu nodes.

The important configuration option to be turned on is {{rtcsync}} as described 
[here|https://issues.apache.org/jira/browse/KUDU-2573?focusedCommentId=17029145&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17029145].

I updated the NTP troubleshooting docs: https://gerrit.cloudera.org/#/c/15320/

It's necessary to update the rest of the docs (e.g., building Kudu, etc.) and 
before updating the resolution of this JIRA item.

> Fully support Chrony in place of NTP
> 
>
> Key: KUDU-2573
> URL: https://issues.apache.org/jira/browse/KUDU-2573
> Project: Kudu
>  Issue Type: New Feature
>  Components: clock, master, tserver
>Reporter: Grant Henke
>Assignee: Alexey Serbin
>Priority: Major
>  Labels: clock
>
> This is to track fully supporting Chrony in place of NTP. Given Chrony is the 
> default in RHEL7+, running Kudu with Chrony is likely to be more common. 
> The work should entail:
>  * identifying and fixing or documenting any differences or gaps
>  * removing the experimental warnings from the documentation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-2322) Leader spews logs when follower falls behind log GC

2020-02-27 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-2322:

Fix Version/s: (was: 1.7.0)
   1.7.1

> Leader spews logs when follower falls behind log GC
> ---
>
> Key: KUDU-2322
> URL: https://issues.apache.org/jira/browse/KUDU-2322
> Project: Kudu
>  Issue Type: Bug
>  Components: consensus
>Affects Versions: 1.7.0
>Reporter: Todd Lipcon
>Assignee: Alexey Serbin
>Priority: Critical
> Fix For: 1.8.0, 1.7.1
>
>
> I'm running a YCSB-based write stress test and found that one of the 
> followers fell behind enough that its logs got GCed by the leader. At this 
> point, the leader started logging about 100 messages per second indicating 
> that it could not obtain a request for this peer.
> I believe this is a regression since 1.6, since before 3-4-3 replication we 
> would have evicted the replica as soon as it fell behind GC.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-2342) Non-voter replicas can be promoted and get stuck

2020-02-27 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-2342:

Fix Version/s: 1.8.0

> Non-voter replicas can be promoted and get stuck
> 
>
> Key: KUDU-2342
> URL: https://issues.apache.org/jira/browse/KUDU-2342
> Project: Kudu
>  Issue Type: Bug
>  Components: tablet
>Affects Versions: 1.7.0
>Reporter: Mostafa Mokhtar
>Assignee: Alexey Serbin
>Priority: Blocker
>  Labels: scalability
> Fix For: 1.8.0, 1.7.1
>
> Attachments: Impala query profile.txt, tablet-info.html
>
>
> While loading TPCH 30TB on 129 node cluster via Impala, write operation 
> failed with :
> Query Status: Kudu error(s) reported, first error: Timed out: Failed to 
> write batch of 38590 ops to tablet b8431200388d486995a4426c88bc06a2 after 1 
> attempt(s): Failed to write to server: a260dca5a9c846e99cb621881a7b86b8 
> (vc1515.halxg.cloudera.com:7050): Write RPC to X.X.X.X:7050 timed out after 
> 180.000s (SENT)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-2342) Non-voter replicas can be promoted and get stuck

2020-02-27 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-2342:

Fix Version/s: (was: 1.7.0)
   1.7.1

> Non-voter replicas can be promoted and get stuck
> 
>
> Key: KUDU-2342
> URL: https://issues.apache.org/jira/browse/KUDU-2342
> Project: Kudu
>  Issue Type: Bug
>  Components: tablet
>Affects Versions: 1.7.0
>Reporter: Mostafa Mokhtar
>Assignee: Alexey Serbin
>Priority: Blocker
>  Labels: scalability
> Fix For: 1.7.1
>
> Attachments: Impala query profile.txt, tablet-info.html
>
>
> While loading TPCH 30TB on 129 node cluster via Impala, write operation 
> failed with :
> Query Status: Kudu error(s) reported, first error: Timed out: Failed to 
> write batch of 38590 ops to tablet b8431200388d486995a4426c88bc06a2 after 1 
> attempt(s): Failed to write to server: a260dca5a9c846e99cb621881a7b86b8 
> (vc1515.halxg.cloudera.com:7050): Write RPC to X.X.X.X:7050 timed out after 
> 180.000s (SENT)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KUDU-3062) built-in NTP client: create a mock NTP server

2020-02-25 Thread Alexey Serbin (Jira)
Alexey Serbin created KUDU-3062:
---

 Summary: built-in NTP client: create a mock NTP server
 Key: KUDU-3062
 URL: https://issues.apache.org/jira/browse/KUDU-3062
 Project: Kudu
  Issue Type: Sub-task
Reporter: Alexey Serbin


To test the functionality of NTP packet sanitisation performed by the built-in 
NTP client, it's necessary to create a mock NTP server to simulate various 
corner cases.

The motivation for this: running tests against chronyd as a test NTP server is 
great, but we need to make sure the client works with other NTP servers which 
are available in the wild.  For more context, see 
https://gerrit.cloudera.org/#/c/15274/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (KUDU-3053) Make 'kudu cluster ksck' check and report on the time source settings in a Kudu cluster

2020-02-24 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin reassigned KUDU-3053:
---

Assignee: Alexey Serbin

> Make 'kudu cluster ksck' check and report on the time source settings in a 
> Kudu cluster
> ---
>
> Key: KUDU-3053
> URL: https://issues.apache.org/jira/browse/KUDU-3053
> Project: Kudu
>  Issue Type: Improvement
>  Components: CLI
>Reporter: Alexey Serbin
>Assignee: Alexey Serbin
>Priority: Major
>  Labels: clock
>
> Since the time source for the hybrid clock is now configurable (and even 
> auto-configurable), it's important to ensure that the time source is 
> configured consistently across the cluster.  At least, there should be a way 
> to spot discrepancies in the time source across all Kudu servers 
> (masters/tservers).
> Let's add extra functionality into the {{kudu cluster ksck}} logic to verify 
> that the time source is configured uniformly across Kudu cluster.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KUDU-3048) Add time/clock synchronization metrics

2020-02-21 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin resolved KUDU-3048.
-
Fix Version/s: 1.12.0
   Resolution: Fixed

> Add time/clock synchronization metrics
> --
>
> Key: KUDU-3048
> URL: https://issues.apache.org/jira/browse/KUDU-3048
> Project: Kudu
>  Issue Type: Improvement
>  Components: clock, master, tserver
>Reporter: Alexey Serbin
>Assignee: Alexey Serbin
>Priority: Major
>  Labels: clock
> Fix For: 1.12.0
>
>
> For better visibility, it would be great to add metrics reflecting time/clock 
> synchronization parameters:
> * the stats on the max_error sampled while reading the underlying clock
> * the stats on time intervals when the underlying clock was extrapolated 
> instead of using the actual readings: number of such intervals and stats on 
> the interval duration
> * whether hybrid clock timestamps are generated using interpolated clock 
> readings instead of real ones
> * if using the {{built-in}} time source:
> ** difference between tracked true time and local wallclock
> ** most recently computed true time
> ** the stats on the maximum error of the computed true time
> As for the rationale behind the new metrics:
> * max_error shows how far the clock is from the true time, and maybe it's 
> time to use other set of NTP servers or instead increase the 
> {{\-\-max_clock_sync_error_usec}} flag value
> * presence of the extrapolation intervals for the hybrid clock signals about 
> periods of non-availability for NTP servers, and possible action would be 
> re-visiting the set of NTP servers
> * if hybrid timestamps are being extrapolated for some time, Kudu masters and 
> tablet servers might crash if the clock errors eventually goes beyond the 
> configured threshold: it's time to start troubleshooting the issue to avoid 
> possible non-availability of the cluster
> * the delta between true time tracked by the built-in NTP client and the 
> local system clock is useful to understand how the log timestamps are related 
> to the HybridClock timestamps (in case of using the built-in NTP client those 
> might diverge)
> * the stats on true time computed by the built-in NTP client give insights on 
> the quality of the reference NTP servers
> The new metrics can be used for monitoring and alerting, allowing for 
> pro-active maintenance of a Kudu cluster.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-3048) Add time/clock synchronization metrics

2020-02-21 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3048:

Description: 
For better visibility, it would be great to add metrics reflecting time/clock 
synchronization parameters:
* the stats on the max_error sampled while reading the underlying clock
* the stats on time intervals when the underlying clock was extrapolated 
instead of using the actual readings: number of such intervals and stats on the 
interval duration
* whether hybrid clock timestamps are generated using interpolated clock 
readings instead of real ones
* if using the {{built-in}} time source:
** difference between tracked true time and local wallclock
** most recently computed true time
** the stats on the maximum error of the computed true time

As for the rationale behind the new metrics:
* max_error shows how far the clock is from the true time, and maybe it's time 
to use other set of NTP servers or instead increase the 
{{\-\-max_clock_sync_error_usec}} flag value
* presence of the extrapolation intervals for the hybrid clock signals about 
periods of non-availability for NTP servers, and possible action would be 
re-visiting the set of NTP servers
* if hybrid timestamps are being extrapolated for some time, Kudu masters and 
tablet servers might crash if the clock errors eventually goes beyond the 
configured threshold: it's time to start troubleshooting the issue to avoid 
possible non-availability of the cluster
* the delta between true time tracked by the built-in NTP client and the local 
system clock is useful to understand how the log timestamps are related to the 
HybridClock timestamps (in case of using the built-in NTP client those might 
diverge)
* the stats on true time computed by the built-in NTP client give insights on 
the quality of the reference NTP servers


The new metrics can be used for monitoring and alerting, allowing for 
pro-active maintenance of a Kudu cluster.

  was:
For better visibility, it would be great to add metrics reflecting time/clock 
synchronization parameters:
* the stats on the max_error sampled while reading the underlying clock
* the stats on time intervals when the underlying clock was extrapolated 
instead of using the actual readings: number of such intervals and stats on the 
interval duration
* whether hybrid clock timestamps are generated using interpolated clock 
readings instead of real ones
* if using the {{built-in}} time source:
** difference between tracked true time and local wallclock
** most recently computed true time
** the stats on the maximum error of the computed true time

As for the rationale behind the new metrics:
* max_error shows how far the clock is from the true time, and maybe it's time 
to use other set of NTP servers or instead increase the 
{{\-\-max_clock_sync_error_usec}} flag value
* presence of the extrapolation intervals for the hybrid clock signals about 
periods of non-availability for NTP servers, and possible action would be 
re-visiting the set of NTP servers
* if hybrid timestamps are being extrapolated for some time, Kudu masters and 
tablet servers might crash if the clock errors eventually goes beyond the 
configured threshold: it's time to start troubleshooting the issue to avoid 
possible non-availability of the cluster

The new metrics can be used for monitoring and alerting, allowing for 
pro-active maintenance of a Kudu cluster.


> Add time/clock synchronization metrics
> --
>
> Key: KUDU-3048
> URL: https://issues.apache.org/jira/browse/KUDU-3048
> Project: Kudu
>  Issue Type: Improvement
>  Components: clock, master, tserver
>Reporter: Alexey Serbin
>Assignee: Alexey Serbin
>Priority: Major
>  Labels: clock
>
> For better visibility, it would be great to add metrics reflecting time/clock 
> synchronization parameters:
> * the stats on the max_error sampled while reading the underlying clock
> * the stats on time intervals when the underlying clock was extrapolated 
> instead of using the actual readings: number of such intervals and stats on 
> the interval duration
> * whether hybrid clock timestamps are generated using interpolated clock 
> readings instead of real ones
> * if using the {{built-in}} time source:
> ** difference between tracked true time and local wallclock
> ** most recently computed true time
> ** the stats on the maximum error of the computed true time
> As for the rationale behind the new metrics:
> * max_error shows how far the clock is from the true time, and maybe it's 
> time to use other set of NTP servers or instead increase the 
> {{\-\-max_clock_sync_error_usec}} flag value
> * presence of the extrapolation intervals for the hybrid clock signals about 
> periods of non-availability for NTP servers, and possible action would be 

[jira] [Commented] (KUDU-3058) RollingRestartITest.TestWorkloads sometimes fails

2020-02-20 Thread Alexey Serbin (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17041543#comment-17041543
 ] 

Alexey Serbin commented on KUDU-3058:
-

Attaching one more failure log from ASAN pre-commit run. 
[^maintenance_mode-itest.00.txt.xz] 

> RollingRestartITest.TestWorkloads sometimes fails
> -
>
> Key: KUDU-3058
> URL: https://issues.apache.org/jira/browse/KUDU-3058
> Project: Kudu
>  Issue Type: Bug
>  Components: test
>Affects Versions: 1.12.0
>Reporter: Alexey Serbin
>Priority: Minor
> Attachments: maintenance_mode-itest.0.txt, 
> maintenance_mode-itest.0.txt.xz, maintenance_mode-itest.00.txt.xz
>
>
> The scenario sometimes fails with an error like below:
> {noformat}
> 0219 20:46:16.414293 (+11us) service_pool.cc:221] Handling call
> 0219 20:46:18.247943 (+1833650us) inbound_call.cc:162] Queueing success 
> response
> Metrics: {}
> I0219 20:46:19.529562 31739 ts_manager.cc:267] Unset tserver state for 
> f0df965344df403a86c95138c5e0f
> 771 from MAINTENANCE_MODE
> I0219 20:46:19.535373 31739 ts_manager.cc:267] Unset tserver state for 
> c63029ba5b4148ab90e9d437e9487
> c76 from MAINTENANCE_MODE
> I0219 20:46:19.538385 31739 ts_manager.cc:267] Unset tserver state for 
> 4216503399e4476694010b88d3ab8cc5 from MAINTENANCE_MODE
> I0219 20:46:19.542889 31739 ts_manager.cc:267] Unset tserver state for 
> 5080d97cb158428d8f86ab6797dd8149 from MAINTENANCE_MODE
> /data0/jenkins/workspace/kudu-pre-commit-unittest-RELEASE/src/kudu/integration-tests/maintenance_mode-itest.cc:750:
>  Failure
> Value of: s.ok()
>   Actual: true
> Expected: false
> /data0/jenkins/workspace/kudu-pre-commit-unittest-RELEASE/src/kudu/util/test_util.cc:345:
>  Failure
> Failed
> Timed out waiting for assertion to pass.
> {noformat}
> The log is attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-3058) RollingRestartITest.TestWorkloads sometimes fails

2020-02-20 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3058:

Attachment: maintenance_mode-itest.00.txt.xz

> RollingRestartITest.TestWorkloads sometimes fails
> -
>
> Key: KUDU-3058
> URL: https://issues.apache.org/jira/browse/KUDU-3058
> Project: Kudu
>  Issue Type: Bug
>  Components: test
>Affects Versions: 1.12.0
>Reporter: Alexey Serbin
>Priority: Minor
> Attachments: maintenance_mode-itest.0.txt, 
> maintenance_mode-itest.0.txt.xz, maintenance_mode-itest.00.txt.xz
>
>
> The scenario sometimes fails with an error like below:
> {noformat}
> 0219 20:46:16.414293 (+11us) service_pool.cc:221] Handling call
> 0219 20:46:18.247943 (+1833650us) inbound_call.cc:162] Queueing success 
> response
> Metrics: {}
> I0219 20:46:19.529562 31739 ts_manager.cc:267] Unset tserver state for 
> f0df965344df403a86c95138c5e0f
> 771 from MAINTENANCE_MODE
> I0219 20:46:19.535373 31739 ts_manager.cc:267] Unset tserver state for 
> c63029ba5b4148ab90e9d437e9487
> c76 from MAINTENANCE_MODE
> I0219 20:46:19.538385 31739 ts_manager.cc:267] Unset tserver state for 
> 4216503399e4476694010b88d3ab8cc5 from MAINTENANCE_MODE
> I0219 20:46:19.542889 31739 ts_manager.cc:267] Unset tserver state for 
> 5080d97cb158428d8f86ab6797dd8149 from MAINTENANCE_MODE
> /data0/jenkins/workspace/kudu-pre-commit-unittest-RELEASE/src/kudu/integration-tests/maintenance_mode-itest.cc:750:
>  Failure
> Value of: s.ok()
>   Actual: true
> Expected: false
> /data0/jenkins/workspace/kudu-pre-commit-unittest-RELEASE/src/kudu/util/test_util.cc:345:
>  Failure
> Failed
> Timed out waiting for assertion to pass.
> {noformat}
> The log is attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


<    2   3   4   5   6   7   8   9   10   11   >