[jira] [Created] (KUDU-3154) RangerClientTestBase.TestLogging sometimes fails
Alexey Serbin created KUDU-3154: --- Summary: RangerClientTestBase.TestLogging sometimes fails Key: KUDU-3154 URL: https://issues.apache.org/jira/browse/KUDU-3154 Project: Kudu Issue Type: Bug Components: ranger, test Affects Versions: 1.13.0 Reporter: Alexey Serbin Attachments: ranger_client-test.txt.xz The {{RangerClientTestBase.TestLogging}} scenario of the {{ranger_client-test}} sometimes fails (all types of builds) with error message like below: {noformat} src/kudu/ranger/ranger_client-test.cc:398: Failure Failed Bad status: Timed out: timed out while in flight I0620 07:06:02.907177 1140 server.cc:247] Received an EOF from the subprocess I0620 07:06:02.910923 1137 server.cc:317] get failed, inbound queue shut down: Aborted: I0620 07:06:02.910964 1141 server.cc:380] outbound queue shut down: Aborted: I0620 07:06:02.910995 1138 server.cc:317] get failed, inbound queue shut down: Aborted: I0620 07:06:02.910984 1139 server.cc:317] get failed, inbound queue shut down: Aborted: {noformat} The log is attached. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-2727) Contention on the Raft consensus lock can cause tablet service queue overflows
[ https://issues.apache.org/jira/browse/KUDU-2727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-2727: Component/s: (was: perf) tserver consensus > Contention on the Raft consensus lock can cause tablet service queue overflows > -- > > Key: KUDU-2727 > URL: https://issues.apache.org/jira/browse/KUDU-2727 > Project: Kudu > Issue Type: Improvement > Components: consensus, tserver >Reporter: William Berkeley >Assignee: Alexey Serbin >Priority: Major > Labels: performance, scalability > > Here's stacks illustrating the phenomenon: > {noformat} > tids=[2201] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() > 0xb4e68e kudu::consensus::Peer::SignalRequest() > 0xb9c0df kudu::consensus::PeerManager::SignalRequest() > 0xb8c178 kudu::consensus::RaftConsensus::Replicate() > 0xaab816 kudu::tablet::TransactionDriver::Prepare() > 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask() >0x1fa37ed kudu::ThreadPool::DispatchThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[4515] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() > 0xb74c60 kudu::consensus::RaftConsensus::NotifyCommitIndex() > 0xb59307 kudu::consensus::PeerMessageQueue::NotifyObserversTask() > 0xb54058 > _ZN4kudu8internal7InvokerILi2ENS0_9BindStateINS0_15RunnableAdapterIMNS_9consensus16PeerMessageQueueEFvRKSt8functionIFvPNS4_24PeerMessageQueueObserverEEFvPS5_SC_EFvNS0_17UnretainedWrapperIS5_EEZNS5_34NotifyObserversOfCommitIndexChangeElEUlS8_E_EEESH_E3RunEPNS0_13BindStateBaseE >0x1fa37ed kudu::ThreadPool::DispatchThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[22185,22194,22193,22188,22187,22186] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() > 0xb8bff8 > kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm() > 0xaaaef9 kudu::tablet::TransactionDriver::ExecuteAsync() > 0xaa3742 kudu::tablet::TabletReplica::SubmitWrite() > 0x92812d kudu::tserver::TabletServiceImpl::Write() >0x1e28f3c kudu::rpc::GeneratedServiceIf::Handle() >0x1e2986a kudu::rpc::ServicePool::RunThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[22192,22191] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() >0x1e13dec kudu::rpc::ResultTracker::TrackRpc() >0x1e28ef5 kudu::rpc::GeneratedServiceIf::Handle() >0x1e2986a kudu::rpc::ServicePool::RunThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[4426] > 0x379ba0f710 >0x206d3d0 >0x212fd25 google::protobuf::Message::SpaceUsedLong() >0x211dee4 > google::protobuf::internal::GeneratedMessageReflection::SpaceUsedLong() > 0xb6658e kudu::consensus::LogCache::AppendOperations() > 0xb5c539 kudu::consensus::PeerMessageQueue::AppendOperations() > 0xb5c7c7 kudu::consensus::PeerMessageQueue::AppendOperation() > 0xb7c675 > kudu::consensus::RaftConsensus::AppendNewRoundToQueueUnlocked() > 0xb8c147 kudu::consensus::RaftConsensus::Replicate() > 0xaab816 kudu::tablet::TransactionDriver::Prepare() > 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask() >0x1fa37ed kudu::ThreadPool::DispatchThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > {noformat} > {{kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm()}} needs to > take the lock to check the term and the Raft role. When many RPCs come in for > the same tablet, the contention can hog service threads and cause queue > overflows on busy systems. > Yugabyte switched their equivalent lock to be an atomic that allows them to > read the term and role wait-free. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-2727) Contention on the Raft consensus lock can cause tablet service queue overflows
[ https://issues.apache.org/jira/browse/KUDU-2727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-2727: Status: In Review (was: In Progress) > Contention on the Raft consensus lock can cause tablet service queue overflows > -- > > Key: KUDU-2727 > URL: https://issues.apache.org/jira/browse/KUDU-2727 > Project: Kudu > Issue Type: Improvement > Components: perf >Reporter: William Berkeley >Assignee: Alexey Serbin >Priority: Major > Labels: performance, scalability > > Here's stacks illustrating the phenomenon: > {noformat} > tids=[2201] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() > 0xb4e68e kudu::consensus::Peer::SignalRequest() > 0xb9c0df kudu::consensus::PeerManager::SignalRequest() > 0xb8c178 kudu::consensus::RaftConsensus::Replicate() > 0xaab816 kudu::tablet::TransactionDriver::Prepare() > 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask() >0x1fa37ed kudu::ThreadPool::DispatchThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[4515] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() > 0xb74c60 kudu::consensus::RaftConsensus::NotifyCommitIndex() > 0xb59307 kudu::consensus::PeerMessageQueue::NotifyObserversTask() > 0xb54058 > _ZN4kudu8internal7InvokerILi2ENS0_9BindStateINS0_15RunnableAdapterIMNS_9consensus16PeerMessageQueueEFvRKSt8functionIFvPNS4_24PeerMessageQueueObserverEEFvPS5_SC_EFvNS0_17UnretainedWrapperIS5_EEZNS5_34NotifyObserversOfCommitIndexChangeElEUlS8_E_EEESH_E3RunEPNS0_13BindStateBaseE >0x1fa37ed kudu::ThreadPool::DispatchThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[22185,22194,22193,22188,22187,22186] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() > 0xb8bff8 > kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm() > 0xaaaef9 kudu::tablet::TransactionDriver::ExecuteAsync() > 0xaa3742 kudu::tablet::TabletReplica::SubmitWrite() > 0x92812d kudu::tserver::TabletServiceImpl::Write() >0x1e28f3c kudu::rpc::GeneratedServiceIf::Handle() >0x1e2986a kudu::rpc::ServicePool::RunThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[22192,22191] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() >0x1e13dec kudu::rpc::ResultTracker::TrackRpc() >0x1e28ef5 kudu::rpc::GeneratedServiceIf::Handle() >0x1e2986a kudu::rpc::ServicePool::RunThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[4426] > 0x379ba0f710 >0x206d3d0 >0x212fd25 google::protobuf::Message::SpaceUsedLong() >0x211dee4 > google::protobuf::internal::GeneratedMessageReflection::SpaceUsedLong() > 0xb6658e kudu::consensus::LogCache::AppendOperations() > 0xb5c539 kudu::consensus::PeerMessageQueue::AppendOperations() > 0xb5c7c7 kudu::consensus::PeerMessageQueue::AppendOperation() > 0xb7c675 > kudu::consensus::RaftConsensus::AppendNewRoundToQueueUnlocked() > 0xb8c147 kudu::consensus::RaftConsensus::Replicate() > 0xaab816 kudu::tablet::TransactionDriver::Prepare() > 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask() >0x1fa37ed kudu::ThreadPool::DispatchThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > {noformat} > {{kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm()}} needs to > take the lock to check the term and the Raft role. When many RPCs come in for > the same tablet, the contention can hog service threads and cause queue > overflows on busy systems. > Yugabyte switched their equivalent lock to be an atomic that allows them to > read the term and role wait-free. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (KUDU-2727) Contention on the Raft consensus lock can cause tablet service queue overflows
[ https://issues.apache.org/jira/browse/KUDU-2727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin reassigned KUDU-2727: --- Assignee: Alexey Serbin > Contention on the Raft consensus lock can cause tablet service queue overflows > -- > > Key: KUDU-2727 > URL: https://issues.apache.org/jira/browse/KUDU-2727 > Project: Kudu > Issue Type: Improvement > Components: perf >Reporter: William Berkeley >Assignee: Alexey Serbin >Priority: Major > Labels: performance, scalability > > Here's stacks illustrating the phenomenon: > {noformat} > tids=[2201] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() > 0xb4e68e kudu::consensus::Peer::SignalRequest() > 0xb9c0df kudu::consensus::PeerManager::SignalRequest() > 0xb8c178 kudu::consensus::RaftConsensus::Replicate() > 0xaab816 kudu::tablet::TransactionDriver::Prepare() > 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask() >0x1fa37ed kudu::ThreadPool::DispatchThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[4515] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() > 0xb74c60 kudu::consensus::RaftConsensus::NotifyCommitIndex() > 0xb59307 kudu::consensus::PeerMessageQueue::NotifyObserversTask() > 0xb54058 > _ZN4kudu8internal7InvokerILi2ENS0_9BindStateINS0_15RunnableAdapterIMNS_9consensus16PeerMessageQueueEFvRKSt8functionIFvPNS4_24PeerMessageQueueObserverEEFvPS5_SC_EFvNS0_17UnretainedWrapperIS5_EEZNS5_34NotifyObserversOfCommitIndexChangeElEUlS8_E_EEESH_E3RunEPNS0_13BindStateBaseE >0x1fa37ed kudu::ThreadPool::DispatchThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[22185,22194,22193,22188,22187,22186] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() > 0xb8bff8 > kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm() > 0xaaaef9 kudu::tablet::TransactionDriver::ExecuteAsync() > 0xaa3742 kudu::tablet::TabletReplica::SubmitWrite() > 0x92812d kudu::tserver::TabletServiceImpl::Write() >0x1e28f3c kudu::rpc::GeneratedServiceIf::Handle() >0x1e2986a kudu::rpc::ServicePool::RunThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[22192,22191] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() >0x1e13dec kudu::rpc::ResultTracker::TrackRpc() >0x1e28ef5 kudu::rpc::GeneratedServiceIf::Handle() >0x1e2986a kudu::rpc::ServicePool::RunThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[4426] > 0x379ba0f710 >0x206d3d0 >0x212fd25 google::protobuf::Message::SpaceUsedLong() >0x211dee4 > google::protobuf::internal::GeneratedMessageReflection::SpaceUsedLong() > 0xb6658e kudu::consensus::LogCache::AppendOperations() > 0xb5c539 kudu::consensus::PeerMessageQueue::AppendOperations() > 0xb5c7c7 kudu::consensus::PeerMessageQueue::AppendOperation() > 0xb7c675 > kudu::consensus::RaftConsensus::AppendNewRoundToQueueUnlocked() > 0xb8c147 kudu::consensus::RaftConsensus::Replicate() > 0xaab816 kudu::tablet::TransactionDriver::Prepare() > 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask() >0x1fa37ed kudu::ThreadPool::DispatchThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > {noformat} > {{kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm()}} needs to > take the lock to check the term and the Raft role. When many RPCs come in for > the same tablet, the contention can hog service threads and cause queue > overflows on busy systems. > Yugabyte switched their equivalent lock to be an atomic that allows them to > read the term and role wait-free. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-2727) Contention on the Raft consensus lock can cause tablet service queue overflows
[ https://issues.apache.org/jira/browse/KUDU-2727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-2727: Labels: performance scalability (was: ) > Contention on the Raft consensus lock can cause tablet service queue overflows > -- > > Key: KUDU-2727 > URL: https://issues.apache.org/jira/browse/KUDU-2727 > Project: Kudu > Issue Type: Improvement > Components: perf >Reporter: William Berkeley >Priority: Major > Labels: performance, scalability > > Here's stacks illustrating the phenomenon: > {noformat} > tids=[2201] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() > 0xb4e68e kudu::consensus::Peer::SignalRequest() > 0xb9c0df kudu::consensus::PeerManager::SignalRequest() > 0xb8c178 kudu::consensus::RaftConsensus::Replicate() > 0xaab816 kudu::tablet::TransactionDriver::Prepare() > 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask() >0x1fa37ed kudu::ThreadPool::DispatchThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[4515] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() > 0xb74c60 kudu::consensus::RaftConsensus::NotifyCommitIndex() > 0xb59307 kudu::consensus::PeerMessageQueue::NotifyObserversTask() > 0xb54058 > _ZN4kudu8internal7InvokerILi2ENS0_9BindStateINS0_15RunnableAdapterIMNS_9consensus16PeerMessageQueueEFvRKSt8functionIFvPNS4_24PeerMessageQueueObserverEEFvPS5_SC_EFvNS0_17UnretainedWrapperIS5_EEZNS5_34NotifyObserversOfCommitIndexChangeElEUlS8_E_EEESH_E3RunEPNS0_13BindStateBaseE >0x1fa37ed kudu::ThreadPool::DispatchThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[22185,22194,22193,22188,22187,22186] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() > 0xb8bff8 > kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm() > 0xaaaef9 kudu::tablet::TransactionDriver::ExecuteAsync() > 0xaa3742 kudu::tablet::TabletReplica::SubmitWrite() > 0x92812d kudu::tserver::TabletServiceImpl::Write() >0x1e28f3c kudu::rpc::GeneratedServiceIf::Handle() >0x1e2986a kudu::rpc::ServicePool::RunThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[22192,22191] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() >0x1e13dec kudu::rpc::ResultTracker::TrackRpc() >0x1e28ef5 kudu::rpc::GeneratedServiceIf::Handle() >0x1e2986a kudu::rpc::ServicePool::RunThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[4426] > 0x379ba0f710 >0x206d3d0 >0x212fd25 google::protobuf::Message::SpaceUsedLong() >0x211dee4 > google::protobuf::internal::GeneratedMessageReflection::SpaceUsedLong() > 0xb6658e kudu::consensus::LogCache::AppendOperations() > 0xb5c539 kudu::consensus::PeerMessageQueue::AppendOperations() > 0xb5c7c7 kudu::consensus::PeerMessageQueue::AppendOperation() > 0xb7c675 > kudu::consensus::RaftConsensus::AppendNewRoundToQueueUnlocked() > 0xb8c147 kudu::consensus::RaftConsensus::Replicate() > 0xaab816 kudu::tablet::TransactionDriver::Prepare() > 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask() >0x1fa37ed kudu::ThreadPool::DispatchThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > {noformat} > {{kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm()}} needs to > take the lock to check the term and the Raft role. When many RPCs come in for > the same tablet, the contention can hog service threads and cause queue > overflows on busy systems. > Yugabyte switched their equivalent lock to be an atomic that allows them to > read the term and role wait-free. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-2727) Contention on the Raft consensus lock can cause tablet service queue overflows
[ https://issues.apache.org/jira/browse/KUDU-2727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-2727: Code Review: https://gerrit.cloudera.org/#/c/16034/ > Contention on the Raft consensus lock can cause tablet service queue overflows > -- > > Key: KUDU-2727 > URL: https://issues.apache.org/jira/browse/KUDU-2727 > Project: Kudu > Issue Type: Improvement > Components: perf >Reporter: William Berkeley >Assignee: Alexey Serbin >Priority: Major > Labels: performance, scalability > > Here's stacks illustrating the phenomenon: > {noformat} > tids=[2201] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() > 0xb4e68e kudu::consensus::Peer::SignalRequest() > 0xb9c0df kudu::consensus::PeerManager::SignalRequest() > 0xb8c178 kudu::consensus::RaftConsensus::Replicate() > 0xaab816 kudu::tablet::TransactionDriver::Prepare() > 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask() >0x1fa37ed kudu::ThreadPool::DispatchThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[4515] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() > 0xb74c60 kudu::consensus::RaftConsensus::NotifyCommitIndex() > 0xb59307 kudu::consensus::PeerMessageQueue::NotifyObserversTask() > 0xb54058 > _ZN4kudu8internal7InvokerILi2ENS0_9BindStateINS0_15RunnableAdapterIMNS_9consensus16PeerMessageQueueEFvRKSt8functionIFvPNS4_24PeerMessageQueueObserverEEFvPS5_SC_EFvNS0_17UnretainedWrapperIS5_EEZNS5_34NotifyObserversOfCommitIndexChangeElEUlS8_E_EEESH_E3RunEPNS0_13BindStateBaseE >0x1fa37ed kudu::ThreadPool::DispatchThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[22185,22194,22193,22188,22187,22186] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() > 0xb8bff8 > kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm() > 0xaaaef9 kudu::tablet::TransactionDriver::ExecuteAsync() > 0xaa3742 kudu::tablet::TabletReplica::SubmitWrite() > 0x92812d kudu::tserver::TabletServiceImpl::Write() >0x1e28f3c kudu::rpc::GeneratedServiceIf::Handle() >0x1e2986a kudu::rpc::ServicePool::RunThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[22192,22191] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() >0x1e13dec kudu::rpc::ResultTracker::TrackRpc() >0x1e28ef5 kudu::rpc::GeneratedServiceIf::Handle() >0x1e2986a kudu::rpc::ServicePool::RunThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[4426] > 0x379ba0f710 >0x206d3d0 >0x212fd25 google::protobuf::Message::SpaceUsedLong() >0x211dee4 > google::protobuf::internal::GeneratedMessageReflection::SpaceUsedLong() > 0xb6658e kudu::consensus::LogCache::AppendOperations() > 0xb5c539 kudu::consensus::PeerMessageQueue::AppendOperations() > 0xb5c7c7 kudu::consensus::PeerMessageQueue::AppendOperation() > 0xb7c675 > kudu::consensus::RaftConsensus::AppendNewRoundToQueueUnlocked() > 0xb8c147 kudu::consensus::RaftConsensus::Replicate() > 0xaab816 kudu::tablet::TransactionDriver::Prepare() > 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask() >0x1fa37ed kudu::ThreadPool::DispatchThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > {noformat} > {{kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm()}} needs to > take the lock to check the term and the Raft role. When many RPCs come in for > the same tablet, the contention can hog service threads and cause queue > overflows on busy systems. > Yugabyte switched their equivalent lock to be an atomic that allows them to > read the term and role wait-free. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KUDU-3129) ToolTest.TestHmsList can timeout
[ https://issues.apache.org/jira/browse/KUDU-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17136729#comment-17136729 ] Alexey Serbin commented on KUDU-3129: - The test also times out in case of RELEASE builds: the log is attached. [^kudu-tool-test.2.txt.xz] > ToolTest.TestHmsList can timeout > > > Key: KUDU-3129 > URL: https://issues.apache.org/jira/browse/KUDU-3129 > Project: Kudu > Issue Type: Bug > Components: hms, test >Affects Versions: 1.12.0 >Reporter: Andrew Wong >Priority: Major > Attachments: kudu-tool-test.2.txt, kudu-tool-test.2.txt.xz > > > When running in TSAN mode, the test timed out, spending 10 minutes not really > doing anything. It isn't obvious why, but ToolTest.TestHmsList can timeout, > appearing to hang while running the HMS tool. > {code} > I0521 22:31:49.436857 4601 catalog_manager.cc:1161] Initializing in-progress > tserver states... > I0521 22:31:49.446161 4606 hms_notification_log_listener.cc:228] Skipping > Hive Metastore notification log poll: Service unavailable: Catalog manager is > not initialized. State: Starting > I0521 22:31:49.839709 4488 heartbeater.cc:325] Connected to a master server > at 127.0.89.254:42487 > I0521 22:31:49.845547 4559 master_service.cc:295] Got heartbeat from unknown > tserver (permanent_uuid: "cf9e08c4271e4d9aa28b1aacbd630908" instance_seqno: > 1590100304311876) as {username='slave'} at 127.0.89.193:33867; Asking this > server to re-register. > I0521 22:31:49.846786 4488 heartbeater.cc:416] Registering TS with master... > I0521 22:31:49.847297 4488 heartbeater.cc:465] Master 127.0.89.254:42487 > requested a full tablet report, sending... > I0521 22:31:49.849771 4559 ts_manager.cc:191] Registered new tserver with > Master: cf9e08c4271e4d9aa28b1aacbd630908 (127.0.89.193:43527) > I0521 22:31:49.852535 359 external_mini_cluster.cc:699] 1 TS(s) registered > with all masters > W0521 22:32:23.142868 4545 debug-util.cc:402] Leaking SignalData structure > 0x7b080002b060 after lost signal to thread 4531 > W0521 22:32:23.14 4545 debug-util.cc:402] Leaking SignalData structure > 0x7b080002b780 after lost signal to thread 4591 > W0521 22:32:28.996440 4545 debug-util.cc:402] Leaking SignalData structure > 0x7b080002b740 after lost signal to thread 4531 > W0521 22:32:28.996966 4545 debug-util.cc:402] Leaking SignalData structure > 0x7b080002b520 after lost signal to thread 4591 > W0521 22:33:05.743249 4380 debug-util.cc:402] Leaking SignalData structure > 0x7b080002aae0 after lost signal to thread 4360 > W0521 22:33:05.743983 4380 debug-util.cc:402] Leaking SignalData structure > 0x7b080002af00 after lost signal to thread 4486 > I0521 22:33:49.594769 4549 maintenance_manager.cc:326] P > c3cc85c33a5447b2aa520019fe162966: Scheduling > FlushMRSOp(): perf score=0.033386 > I0521 22:33:49.637208 4548 maintenance_manager.cc:525] P > c3cc85c33a5447b2aa520019fe162966: > FlushMRSOp() complete. Timing: real 0.042s > user 0.032s sys 0.008s Metrics: > {"bytes_written":6485,"cfile_init":1,"dirs.queue_time_us":675,"dirs.run_cpu_time_us":237,"dirs.run_wall_time_us":997,"drs_written":1,"lbm_read_time_us":231,"lbm_reads_lt_1ms":4,"lbm_write_time_us":1980,"lbm_writes_lt_1ms":27,"rows_written":5,"thread_start_us":953,"threads_started":2,"wal-append.queue_time_us":819} > I0521 22:33:49.639096 4549 maintenance_manager.cc:326] P > c3cc85c33a5447b2aa520019fe162966: Scheduling > UndoDeltaBlockGCOp(): 396 bytes on disk > I0521 22:33:49.640486 4548 maintenance_manager.cc:525] P > c3cc85c33a5447b2aa520019fe162966: > UndoDeltaBlockGCOp() complete. Timing: real > 0.001suser 0.001s sys 0.000s Metrics: > {"cfile_init":1,"lbm_read_time_us":269,"lbm_reads_lt_1ms":4} > W0521 22:34:17.794472 4380 debug-util.cc:402] Leaking SignalData structure > 0x7b080002ade0 after lost signal to thread 4360 > W0521 22:34:17.795437 4380 debug-util.cc:402] Leaking SignalData structure > 0x7b080002a7e0 after lost signal to thread 4486 > W0521 22:34:20.286921 4545 debug-util.cc:402] Leaking SignalData structure > 0x7b080002b2e0 after lost signal to thread 4531 > W0521 22:34:20.287376 4545 debug-util.cc:402] Leaking SignalData structure > 0x7b080002b140 after lost signal to thread 4591 > W0521 22:35:27.726336 4380 debug-util.cc:402] Leaking SignalData structure > 0x7b080002af40 after lost signal to thread 4360 > W0521 22:35:27.727084 4380 debug-util.cc:402] Leaking SignalData structure > 0x7b080002a980 after lost signal to thread 4486 > W0521 22:36:12.250830 4545 debug-util.cc:402] Leaking SignalData structure > 0x7b080002b9c0 after lost signal to thread 4531 > W0521 22:36:12.25124
[jira] [Updated] (KUDU-3129) ToolTest.TestHmsList can timeout
[ https://issues.apache.org/jira/browse/KUDU-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3129: Attachment: kudu-tool-test.2.txt.xz > ToolTest.TestHmsList can timeout > > > Key: KUDU-3129 > URL: https://issues.apache.org/jira/browse/KUDU-3129 > Project: Kudu > Issue Type: Bug > Components: hms, test >Affects Versions: 1.12.0 >Reporter: Andrew Wong >Priority: Major > Attachments: kudu-tool-test.2.txt, kudu-tool-test.2.txt.xz > > > When running in TSAN mode, the test timed out, spending 10 minutes not really > doing anything. It isn't obvious why, but ToolTest.TestHmsList can timeout, > appearing to hang while running the HMS tool. > {code} > I0521 22:31:49.436857 4601 catalog_manager.cc:1161] Initializing in-progress > tserver states... > I0521 22:31:49.446161 4606 hms_notification_log_listener.cc:228] Skipping > Hive Metastore notification log poll: Service unavailable: Catalog manager is > not initialized. State: Starting > I0521 22:31:49.839709 4488 heartbeater.cc:325] Connected to a master server > at 127.0.89.254:42487 > I0521 22:31:49.845547 4559 master_service.cc:295] Got heartbeat from unknown > tserver (permanent_uuid: "cf9e08c4271e4d9aa28b1aacbd630908" instance_seqno: > 1590100304311876) as {username='slave'} at 127.0.89.193:33867; Asking this > server to re-register. > I0521 22:31:49.846786 4488 heartbeater.cc:416] Registering TS with master... > I0521 22:31:49.847297 4488 heartbeater.cc:465] Master 127.0.89.254:42487 > requested a full tablet report, sending... > I0521 22:31:49.849771 4559 ts_manager.cc:191] Registered new tserver with > Master: cf9e08c4271e4d9aa28b1aacbd630908 (127.0.89.193:43527) > I0521 22:31:49.852535 359 external_mini_cluster.cc:699] 1 TS(s) registered > with all masters > W0521 22:32:23.142868 4545 debug-util.cc:402] Leaking SignalData structure > 0x7b080002b060 after lost signal to thread 4531 > W0521 22:32:23.14 4545 debug-util.cc:402] Leaking SignalData structure > 0x7b080002b780 after lost signal to thread 4591 > W0521 22:32:28.996440 4545 debug-util.cc:402] Leaking SignalData structure > 0x7b080002b740 after lost signal to thread 4531 > W0521 22:32:28.996966 4545 debug-util.cc:402] Leaking SignalData structure > 0x7b080002b520 after lost signal to thread 4591 > W0521 22:33:05.743249 4380 debug-util.cc:402] Leaking SignalData structure > 0x7b080002aae0 after lost signal to thread 4360 > W0521 22:33:05.743983 4380 debug-util.cc:402] Leaking SignalData structure > 0x7b080002af00 after lost signal to thread 4486 > I0521 22:33:49.594769 4549 maintenance_manager.cc:326] P > c3cc85c33a5447b2aa520019fe162966: Scheduling > FlushMRSOp(): perf score=0.033386 > I0521 22:33:49.637208 4548 maintenance_manager.cc:525] P > c3cc85c33a5447b2aa520019fe162966: > FlushMRSOp() complete. Timing: real 0.042s > user 0.032s sys 0.008s Metrics: > {"bytes_written":6485,"cfile_init":1,"dirs.queue_time_us":675,"dirs.run_cpu_time_us":237,"dirs.run_wall_time_us":997,"drs_written":1,"lbm_read_time_us":231,"lbm_reads_lt_1ms":4,"lbm_write_time_us":1980,"lbm_writes_lt_1ms":27,"rows_written":5,"thread_start_us":953,"threads_started":2,"wal-append.queue_time_us":819} > I0521 22:33:49.639096 4549 maintenance_manager.cc:326] P > c3cc85c33a5447b2aa520019fe162966: Scheduling > UndoDeltaBlockGCOp(): 396 bytes on disk > I0521 22:33:49.640486 4548 maintenance_manager.cc:525] P > c3cc85c33a5447b2aa520019fe162966: > UndoDeltaBlockGCOp() complete. Timing: real > 0.001suser 0.001s sys 0.000s Metrics: > {"cfile_init":1,"lbm_read_time_us":269,"lbm_reads_lt_1ms":4} > W0521 22:34:17.794472 4380 debug-util.cc:402] Leaking SignalData structure > 0x7b080002ade0 after lost signal to thread 4360 > W0521 22:34:17.795437 4380 debug-util.cc:402] Leaking SignalData structure > 0x7b080002a7e0 after lost signal to thread 4486 > W0521 22:34:20.286921 4545 debug-util.cc:402] Leaking SignalData structure > 0x7b080002b2e0 after lost signal to thread 4531 > W0521 22:34:20.287376 4545 debug-util.cc:402] Leaking SignalData structure > 0x7b080002b140 after lost signal to thread 4591 > W0521 22:35:27.726336 4380 debug-util.cc:402] Leaking SignalData structure > 0x7b080002af40 after lost signal to thread 4360 > W0521 22:35:27.727084 4380 debug-util.cc:402] Leaking SignalData structure > 0x7b080002a980 after lost signal to thread 4486 > W0521 22:36:12.250830 4545 debug-util.cc:402] Leaking SignalData structure > 0x7b080002b9c0 after lost signal to thread 4531 > W0521 22:36:12.251247 4545 debug-util.cc:402] Leaking SignalData structure > 0x7b080002b220 after lost signal to thread 4591 > W0521 22:3
[jira] [Resolved] (KUDU-3145) KUDU_LINK should be set before function APPEND_LINKER_FLAGS is called
[ https://issues.apache.org/jira/browse/KUDU-3145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin resolved KUDU-3145. - Fix Version/s: 1.13.0 Resolution: Fixed > KUDU_LINK should be set before function APPEND_LINKER_FLAGS is called > - > > Key: KUDU-3145 > URL: https://issues.apache.org/jira/browse/KUDU-3145 > Project: Kudu > Issue Type: Sub-task >Reporter: zhaorenhai >Assignee: huangtianhua >Priority: Major > Fix For: 1.13.0 > > Time Spent: 20m > Remaining Estimate: 0h > > KUDU_LINK should be set before function APPEND_LINKER_FLAGS is called > > Because in function APPEND_LINKER_FLAGS , there are following logic: > {code:java} > if ("${LINKER_FAMILY}" STREQUAL "gold") > if("${LINKER_VERSION}" VERSION_LESS "1.12" AND > "${KUDU_LINK}" STREQUAL "d") > message(WARNING "Skipping gold <1.12 with dynamic linking.") > continue() > endif() > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KUDU-2727) Contention on the Raft consensus lock can cause tablet service queue overflows
[ https://issues.apache.org/jira/browse/KUDU-2727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17126243#comment-17126243 ] Alexey Serbin commented on KUDU-2727: - One more set of stack traces: {noformat} tids=[1324418] 0x7f61b79fc5e0 0x1ec35f4 base::internal::SpinLockDelay() 0x1ec347c base::SpinLock::SlowLock() 0xb4236d kudu::consensus::Peer::SendNextRequest() 0xb43771 _ZN5boost6detail8function26void_function_obj_invoker0IZN4kudu9consensus4Peer13SignalRequestEbEUlvE_vE6invokeERNS1_15function_bufferE 0x1eb1d1d kudu::FunctionRunnable::Run() 0x1eaedbf kudu::ThreadPool::DispatchThread() 0x1ea4a84 kudu::Thread::SuperviseThread() 0x7f61b79f4e25 start_thread 0x7f61b5cd234d __clone tids=[93293,93284,93285,93286,93287,93288,93289,93290,93291,93292,93304,93294,93295,93296,93297,93298,93299,93300,93301,93302,93303,93313,93322,93321,93320,93319,93318,93317,93316,93315,93314,93283,93312,93311,93310,93309,93308,93307,93306,93305] 0x7f61b79fc5e0 0x1ec35f4 base::internal::SpinLockDelay() 0x1ec347c base::SpinLock::SlowLock() 0xb7deb8 kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm() 0xaab010 kudu::tablet::TransactionDriver::ExecuteAsync() 0xaa344c kudu::tablet::TabletReplica::SubmitWrite() 0x928fb0 kudu::tserver::TabletServiceImpl::Write() 0x1d2e8d9 kudu::rpc::GeneratedServiceIf::Handle() 0x1d2efd9 kudu::rpc::ServicePool::RunThread() 0x1ea4a84 kudu::Thread::SuperviseThread() 0x7f61b79f4e25 start_thread 0x7f61b5cd234d __clone tids=[1324661] 0x7f61b79fc5e0 0x1ec35f4 base::internal::SpinLockDelay() 0x1ec347c base::SpinLock::SlowLock() 0xb7df8e kudu::consensus::RaftConsensus::Replicate() 0xaab8e7 kudu::tablet::TransactionDriver::Prepare() 0xaac009 kudu::tablet::TransactionDriver::PrepareTask() 0x1eaedbf kudu::ThreadPool::DispatchThread() 0x1ea4a84 kudu::Thread::SuperviseThread() 0x7f61b79f4e25 start_thread 0x7f61b5cd234d __clone tids=[93383] 0x7f61b79fc5e0 0x7f61b79f8cf2 __pthread_cond_timedwait 0x1dfcfa9 kudu::ConditionVariable::WaitUntil() 0xb73bc7 kudu::consensus::RaftConsensus::UpdateReplica() 0xb75128 kudu::consensus::RaftConsensus::Update() 0x92c5d1 kudu::tserver::ConsensusServiceImpl::UpdateConsensus() 0x1d2e8d9 kudu::rpc::GeneratedServiceIf::Handle() 0x1d2efd9 kudu::rpc::ServicePool::RunThread() 0x1ea4a84 kudu::Thread::SuperviseThread() 0x7f61b79f4e25 start_thread 0x7f61b5cd234d __clone {noformat} Thread {{93383}} holds the lock, waiting on another conditional variable and blocks many other threads. > Contention on the Raft consensus lock can cause tablet service queue overflows > -- > > Key: KUDU-2727 > URL: https://issues.apache.org/jira/browse/KUDU-2727 > Project: Kudu > Issue Type: Improvement > Components: perf >Reporter: William Berkeley >Priority: Major > > Here's stacks illustrating the phenomenon: > {noformat} > tids=[2201] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() > 0xb4e68e kudu::consensus::Peer::SignalRequest() > 0xb9c0df kudu::consensus::PeerManager::SignalRequest() > 0xb8c178 kudu::consensus::RaftConsensus::Replicate() > 0xaab816 kudu::tablet::TransactionDriver::Prepare() > 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask() >0x1fa37ed kudu::ThreadPool::DispatchThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[4515] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() > 0xb74c60 kudu::consensus::RaftConsensus::NotifyCommitIndex() > 0xb59307 kudu::consensus::PeerMessageQueue::NotifyObserversTask() > 0xb54058 > _ZN4kudu8internal7InvokerILi2ENS0_9BindStateINS0_15RunnableAdapterIMNS_9consensus16PeerMessageQueueEFvRKSt8functionIFvPNS4_24PeerMessageQueueObserverEEFvPS5_SC_EFvNS0_17UnretainedWrapperIS5_EEZNS5_34NotifyObserversOfCommitIndexChangeElEUlS8_E_EEESH_E3RunEPNS0_13BindStateBaseE >0x1fa37ed kudu::ThreadPool::DispatchThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[22185,22194,22193,22188,22187,22186] > 0x379ba0f710 >0x1fb951a base::
[jira] [Comment Edited] (KUDU-2727) Contention on the Raft consensus lock can cause tablet service queue overflows
[ https://issues.apache.org/jira/browse/KUDU-2727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17125466#comment-17125466 ] Alexey Serbin edited comment on KUDU-2727 at 6/4/20, 3:14 AM: -- Another set of stacks, just for more context (captured with code close to kudu 1.10.1): {noformat} tids=[1866940] 0x7fc8d67f95e0 0x1ec35f4 base::internal::SpinLockDelay() 0x1ec347c base::SpinLock::SlowLock() 0xb68518 kudu::consensus::RaftConsensus::NotifyCommitIndex() 0xb4c9e7 kudu::consensus::PeerMessageQueue::NotifyObserversTask() 0xb47850 _ZN4kudu8internal7InvokerILi2ENS0_9BindStateINS0_15RunnableAdapterIMNS_9consensus16PeerMessageQueueEFvRKSt8functionIFvPNS4_24PeerMessageQueueObserverEEFvPS5_SC_EFvNS0_17UnretainedWrapperIS5_EEZNS5_34NotifyObserversOfCommitIndexChangeElEUlS8_E_EEESH_E3RunEPNS0_13BindStateBaseE 0x1eaedbf kudu::ThreadPool::DispatchThread() 0x1ea4a84 kudu::Thread::SuperviseThread() 0x7fc8d67f1e25 start_thread 0x7fc8d4acf34d __clone tids=[1370336,1370326,1370327,1370328,1370329,1370330,1370331,1370332,1370333,1370334,1370335,1370323,1370337,1370338,1370339,1370340,1370341,1370342,1370343,1370344,1370345,1370346,1370353,1370361,1370360,1370359,1370358,1370357,1370356,1370355,1370354,1370325,1370352,1370351,1370350,1370349,1370348,1370347,1370322,1370324] 0x7fc8d67f95e0 0x1ec35f4 base::internal::SpinLockDelay() 0x1ec347c base::SpinLock::SlowLock() 0xb7deb8 kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm() 0xaab010 kudu::tablet::TransactionDriver::ExecuteAsync() 0xaa344c kudu::tablet::TabletReplica::SubmitWrite() 0x928fb0 kudu::tserver::TabletServiceImpl::Write() 0x1d2e8d9 kudu::rpc::GeneratedServiceIf::Handle() 0x1d2efd9 kudu::rpc::ServicePool::RunThread() 0x1ea4a84 kudu::Thread::SuperviseThread() 0x7fc8d67f1e25 start_thread 0x7fc8d4acf34d __clone tids=[1866932,1866929] 0x7fc8d67f95e0 0x7fc8d67f5943 __pthread_cond_wait 0xb99b38 kudu::log::Log::AsyncAppend() 0xb9c24c kudu::log::Log::AsyncAppendCommit() 0xaad489 kudu::tablet::TransactionDriver::ApplyTask() 0x1eaedbf kudu::ThreadPool::DispatchThread() 0x1ea4a84 kudu::Thread::SuperviseThread() 0x7fc8d67f1e25 start_thread 0x7fc8d4acf34d __clone tids=[1866928] 0x7fc8d67f95e0 0x7fc8d67f5943 __pthread_cond_wait 0xb99b38 kudu::log::Log::AsyncAppend() 0xb9c493 kudu::log::Log::AsyncAppendReplicates() 0xb597e9 kudu::consensus::LogCache::AppendOperations() 0xb4fa24 kudu::consensus::PeerMessageQueue::AppendOperations() 0xb4fd45 kudu::consensus::PeerMessageQueue::AppendOperation() 0xb6f28c kudu::consensus::RaftConsensus::AppendNewRoundToQueueUnlocked() 0xb7dff8 kudu::consensus::RaftConsensus::Replicate() 0xaab8e7 kudu::tablet::TransactionDriver::Prepare() 0xaac009 kudu::tablet::TransactionDriver::PrepareTask() 0x1eaedbf kudu::ThreadPool::DispatchThread() 0x1ea4a84 kudu::Thread::SuperviseThread() {noformat} In the stacks above, thread {{1866928}} is holding a lock taken in {{RaftConsensus::Replicate()}} while waiting on a condition variable in {{Log::AsyncAppend()}}, calling {{entry_batch_queue_.BlockingPut(entry_batch.get())}}. was (Author: aserbin): Another set of stacks, just for more context (captured with code close to kudu 1.10.1): {noformat} tids=[1866940] 0x7fc8d67f95e0 0x1ec35f4 base::internal::SpinLockDelay() 0x1ec347c base::SpinLock::SlowLock() 0xb68518 kudu::consensus::RaftConsensus::NotifyCommitIndex() 0xb4c9e7 kudu::consensus::PeerMessageQueue::NotifyObserversTask() 0xb47850 _ZN4kudu8internal7InvokerILi2ENS0_9BindStateINS0_15RunnableAdapterIMNS_9consensus16PeerMessageQueueEFvRKSt8functionIFvPNS4_24PeerMessageQueueObserverEEFvPS5_SC_EFvNS0_17UnretainedWrapperIS5_EEZNS5_34NotifyObserversOfCommitIndexChangeElEUlS8_E_EEESH_E3RunEPNS0_13BindStateBaseE 0x1eaedbf kudu::ThreadPool::DispatchThread() 0x1ea4a84 kudu::Thread::SuperviseThread() 0x7fc8d67f1e25 start_thread 0x7fc8d4acf34d __clone tids=[1370336,1370326,1370327,1370328,1370329,1370330,1370331,1370332,1370333,1370334,1370335,1370323,1370337,1370338,1370339,1370340,1370341,1370342,1370343,1370344,1370345,1370346,1370353,1370361,1370360,1370359,1370358,1370357,1370356,1370355,1370354,1370325,1370352,1370351,1370350,1370349,1370348,1370347,1370322,1370324] 0x7fc8d67f95e0 0x1ec35f4 base::internal::SpinLockDelay() 0x1ec347c base::SpinLock::SlowLock() 0xb7deb8 kudu::consensus::RaftCo
[jira] [Assigned] (KUDU-2727) Contention on the Raft consensus lock can cause tablet service queue overflows
[ https://issues.apache.org/jira/browse/KUDU-2727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin reassigned KUDU-2727: --- Assignee: (was: Mike Percy) > Contention on the Raft consensus lock can cause tablet service queue overflows > -- > > Key: KUDU-2727 > URL: https://issues.apache.org/jira/browse/KUDU-2727 > Project: Kudu > Issue Type: Improvement > Components: perf >Reporter: William Berkeley >Priority: Major > > Here's stacks illustrating the phenomenon: > {noformat} > tids=[2201] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() > 0xb4e68e kudu::consensus::Peer::SignalRequest() > 0xb9c0df kudu::consensus::PeerManager::SignalRequest() > 0xb8c178 kudu::consensus::RaftConsensus::Replicate() > 0xaab816 kudu::tablet::TransactionDriver::Prepare() > 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask() >0x1fa37ed kudu::ThreadPool::DispatchThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[4515] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() > 0xb74c60 kudu::consensus::RaftConsensus::NotifyCommitIndex() > 0xb59307 kudu::consensus::PeerMessageQueue::NotifyObserversTask() > 0xb54058 > _ZN4kudu8internal7InvokerILi2ENS0_9BindStateINS0_15RunnableAdapterIMNS_9consensus16PeerMessageQueueEFvRKSt8functionIFvPNS4_24PeerMessageQueueObserverEEFvPS5_SC_EFvNS0_17UnretainedWrapperIS5_EEZNS5_34NotifyObserversOfCommitIndexChangeElEUlS8_E_EEESH_E3RunEPNS0_13BindStateBaseE >0x1fa37ed kudu::ThreadPool::DispatchThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[22185,22194,22193,22188,22187,22186] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() > 0xb8bff8 > kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm() > 0xaaaef9 kudu::tablet::TransactionDriver::ExecuteAsync() > 0xaa3742 kudu::tablet::TabletReplica::SubmitWrite() > 0x92812d kudu::tserver::TabletServiceImpl::Write() >0x1e28f3c kudu::rpc::GeneratedServiceIf::Handle() >0x1e2986a kudu::rpc::ServicePool::RunThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[22192,22191] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() >0x1e13dec kudu::rpc::ResultTracker::TrackRpc() >0x1e28ef5 kudu::rpc::GeneratedServiceIf::Handle() >0x1e2986a kudu::rpc::ServicePool::RunThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[4426] > 0x379ba0f710 >0x206d3d0 >0x212fd25 google::protobuf::Message::SpaceUsedLong() >0x211dee4 > google::protobuf::internal::GeneratedMessageReflection::SpaceUsedLong() > 0xb6658e kudu::consensus::LogCache::AppendOperations() > 0xb5c539 kudu::consensus::PeerMessageQueue::AppendOperations() > 0xb5c7c7 kudu::consensus::PeerMessageQueue::AppendOperation() > 0xb7c675 > kudu::consensus::RaftConsensus::AppendNewRoundToQueueUnlocked() > 0xb8c147 kudu::consensus::RaftConsensus::Replicate() > 0xaab816 kudu::tablet::TransactionDriver::Prepare() > 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask() >0x1fa37ed kudu::ThreadPool::DispatchThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > {noformat} > {{kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm()}} needs to > take the lock to check the term and the Raft role. When many RPCs come in for > the same tablet, the contention can hog service threads and cause queue > overflows on busy systems. > Yugabyte switched their equivalent lock to be an atomic that allows them to > read the term and role wait-free. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KUDU-2727) Contention on the Raft consensus lock can cause tablet service queue overflows
[ https://issues.apache.org/jira/browse/KUDU-2727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17125466#comment-17125466 ] Alexey Serbin commented on KUDU-2727: - Another set of stacks, just for more context (captured with code close to kudu 1.10.1): {noformat} tids=[1866940] 0x7fc8d67f95e0 0x1ec35f4 base::internal::SpinLockDelay() 0x1ec347c base::SpinLock::SlowLock() 0xb68518 kudu::consensus::RaftConsensus::NotifyCommitIndex() 0xb4c9e7 kudu::consensus::PeerMessageQueue::NotifyObserversTask() 0xb47850 _ZN4kudu8internal7InvokerILi2ENS0_9BindStateINS0_15RunnableAdapterIMNS_9consensus16PeerMessageQueueEFvRKSt8functionIFvPNS4_24PeerMessageQueueObserverEEFvPS5_SC_EFvNS0_17UnretainedWrapperIS5_EEZNS5_34NotifyObserversOfCommitIndexChangeElEUlS8_E_EEESH_E3RunEPNS0_13BindStateBaseE 0x1eaedbf kudu::ThreadPool::DispatchThread() 0x1ea4a84 kudu::Thread::SuperviseThread() 0x7fc8d67f1e25 start_thread 0x7fc8d4acf34d __clone tids=[1370336,1370326,1370327,1370328,1370329,1370330,1370331,1370332,1370333,1370334,1370335,1370323,1370337,1370338,1370339,1370340,1370341,1370342,1370343,1370344,1370345,1370346,1370353,1370361,1370360,1370359,1370358,1370357,1370356,1370355,1370354,1370325,1370352,1370351,1370350,1370349,1370348,1370347,1370322,1370324] 0x7fc8d67f95e0 0x1ec35f4 base::internal::SpinLockDelay() 0x1ec347c base::SpinLock::SlowLock() 0xb7deb8 kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm() 0xaab010 kudu::tablet::TransactionDriver::ExecuteAsync() 0xaa344c kudu::tablet::TabletReplica::SubmitWrite() 0x928fb0 kudu::tserver::TabletServiceImpl::Write() 0x1d2e8d9 kudu::rpc::GeneratedServiceIf::Handle() 0x1d2efd9 kudu::rpc::ServicePool::RunThread() 0x1ea4a84 kudu::Thread::SuperviseThread() 0x7fc8d67f1e25 start_thread 0x7fc8d4acf34d __clone tids=[1866932,1866929] 0x7fc8d67f95e0 0x7fc8d67f5943 __pthread_cond_wait 0xb99b38 kudu::log::Log::AsyncAppend() 0xb9c24c kudu::log::Log::AsyncAppendCommit() 0xaad489 kudu::tablet::TransactionDriver::ApplyTask() 0x1eaedbf kudu::ThreadPool::DispatchThread() 0x1ea4a84 kudu::Thread::SuperviseThread() 0x7fc8d67f1e25 start_thread 0x7fc8d4acf34d __clone tids=[1866928] 0x7fc8d67f95e0 0x7fc8d67f5943 __pthread_cond_wait 0xb99b38 kudu::log::Log::AsyncAppend() 0xb9c493 kudu::log::Log::AsyncAppendReplicates() 0xb597e9 kudu::consensus::LogCache::AppendOperations() 0xb4fa24 kudu::consensus::PeerMessageQueue::AppendOperations() 0xb4fd45 kudu::consensus::PeerMessageQueue::AppendOperation() 0xb6f28c kudu::consensus::RaftConsensus::AppendNewRoundToQueueUnlocked() 0xb7dff8 kudu::consensus::RaftConsensus::Replicate() 0xaab8e7 kudu::tablet::TransactionDriver::Prepare() 0xaac009 kudu::tablet::TransactionDriver::PrepareTask() 0x1eaedbf kudu::ThreadPool::DispatchThread() 0x1ea4a84 kudu::Thread::SuperviseThread() {noformat} > Contention on the Raft consensus lock can cause tablet service queue overflows > -- > > Key: KUDU-2727 > URL: https://issues.apache.org/jira/browse/KUDU-2727 > Project: Kudu > Issue Type: Improvement > Components: perf >Reporter: William Berkeley >Assignee: Mike Percy >Priority: Major > > Here's stacks illustrating the phenomenon: > {noformat} > tids=[2201] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() > 0xb4e68e kudu::consensus::Peer::SignalRequest() > 0xb9c0df kudu::consensus::PeerManager::SignalRequest() > 0xb8c178 kudu::consensus::RaftConsensus::Replicate() > 0xaab816 kudu::tablet::TransactionDriver::Prepare() > 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask() >0x1fa37ed kudu::ThreadPool::DispatchThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[4515] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() > 0xb74c60 kudu::consensus::RaftConsensus::NotifyCommitIndex() > 0xb59307 kudu::consensus::PeerMessageQueue::NotifyObserversTask() > 0xb54058 > _ZN4kudu8internal7InvokerILi2ENS0_9BindStateINS0_15RunnableAdapterIMNS_9consensus16PeerMessageQueueEFvRKSt8functionIFvPNS4_24PeerMessageQueueObserverEEFvPS5_SC
[jira] [Updated] (KUDU-2781) Hardening for location awareness command-line flag
[ https://issues.apache.org/jira/browse/KUDU-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-2781: Component/s: master > Hardening for location awareness command-line flag > -- > > Key: KUDU-2781 > URL: https://issues.apache.org/jira/browse/KUDU-2781 > Project: Kudu > Issue Type: Improvement > Components: master >Reporter: Alexey Serbin >Assignee: Alexey Serbin >Priority: Major > > Add few verification steps related to the location assignment: > * the location assignment executable is present and executable > * the location assignment executable conforms with the expected interface: it > accepts one argument (IP address or DNS name) and outputs the assigned > location into the stdout > * the same DNS name/IP address assigned the same location > * the result location output into the stdout conforms with the format for > locations in Kudu > It's possible to implement these in {{kudu-master}} using group flag > validators: see the {{GROUP_FLAG_VALIDATOR}} macro. > Performing few verification steps mentioned above should help to avoid > situations when Kudu tablet servers cannot be registered with Kudu master if > the location assignment executable path is misspelled or the executable > behaves not as expected. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-2781) Hardening for location awareness command-line flag
[ https://issues.apache.org/jira/browse/KUDU-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-2781: Description: Add few verification steps related to the location assignment: * the location assignment executable is present and executable * the location assignment executable conforms with the expected interface: it accepts one argument (IP address or DNS name) and outputs the assigned location into the stdout * the same DNS name/IP address assigned the same location * the result location output into the stdout conforms with the format for locations in Kudu It's possible to implement these in {{kudu-master}} using group flag validators: see the {{GROUP_FLAG_VALIDATOR}} macro. Performing few verification steps mentioned above should help to avoid situations when Kudu tablet servers cannot be registered with Kudu master if the location assignment executable path is misspelled or the executable behaves not as expected. > Hardening for location awareness command-line flag > -- > > Key: KUDU-2781 > URL: https://issues.apache.org/jira/browse/KUDU-2781 > Project: Kudu > Issue Type: Improvement >Reporter: Alexey Serbin >Assignee: Alexey Serbin >Priority: Major > > Add few verification steps related to the location assignment: > * the location assignment executable is present and executable > * the location assignment executable conforms with the expected interface: it > accepts one argument (IP address or DNS name) and outputs the assigned > location into the stdout > * the same DNS name/IP address assigned the same location > * the result location output into the stdout conforms with the format for > locations in Kudu > It's possible to implement these in {{kudu-master}} using group flag > validators: see the {{GROUP_FLAG_VALIDATOR}} macro. > Performing few verification steps mentioned above should help to avoid > situations when Kudu tablet servers cannot be registered with Kudu master if > the location assignment executable path is misspelled or the executable > behaves not as expected. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-2781) Hardening for location awareness command-line flag
[ https://issues.apache.org/jira/browse/KUDU-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-2781: Labels: observability supportability (was: ) > Hardening for location awareness command-line flag > -- > > Key: KUDU-2781 > URL: https://issues.apache.org/jira/browse/KUDU-2781 > Project: Kudu > Issue Type: Improvement > Components: master >Reporter: Alexey Serbin >Assignee: Alexey Serbin >Priority: Major > Labels: observability, supportability > > Add few verification steps related to the location assignment: > * the location assignment executable is present and executable > * the location assignment executable conforms with the expected interface: it > accepts one argument (IP address or DNS name) and outputs the assigned > location into the stdout > * the same DNS name/IP address assigned the same location > * the result location output into the stdout conforms with the format for > locations in Kudu > It's possible to implement these in {{kudu-master}} using group flag > validators: see the {{GROUP_FLAG_VALIDATOR}} macro. > Performing few verification steps mentioned above should help to avoid > situations when Kudu tablet servers cannot be registered with Kudu master if > the location assignment executable path is misspelled or the executable > behaves not as expected. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (KUDU-2169) Allow replicas that do not exist to vote
[ https://issues.apache.org/jira/browse/KUDU-2169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17124314#comment-17124314 ] Alexey Serbin edited comment on KUDU-2169 at 6/2/20, 9:08 PM: -- Now we have the 3-4-3 replica management scheme, and we don't use the 3-2-3 scheme anymore. With the 3-4-3 scheme there are scenarios where the system first evicts a replica, and then adds a new non-voter replica: that's when the replica to be evicted fails behind WAL segment GC threshold or experience a disk failure. In very rare cases it might happen that a tablet ends up with leader replica A not being able to replicate/commit the change in the Raft configuration, as described. >From the other side, such a newly added replica D in case of the 3-4-3 scheme >is a non-voter, and it cannot vote by definition. In other words, some manual intervention would be necessary in the described scenario, but not the way how this JIRA proposes it to be implemented. Closing as 'Won't Do'. was (Author: aserbin): Now we have the 3-4-3 replica management scheme, and we don't use 3-2-3 scheme anymore. With the 3-4-3 scheme there are scenarios where the system first evicts a replica, and then adds a new non-voter replica: that's when the replica to be evicted fails behind WAL segment GC threshold or experience a disk failure. In very rare cases it might happen that a tablet ends up with leader replica A, and replica A cannot replicate/commit the change in the Raft configuration as described. >From the other side, such a newly replica D in case of the 3-4-3 scheme is a >non-voter, and it cannot vote by definition. In other words, some manual intervention would be necessary in the described scenario, but not the way how this JIRA proposes. Closing as 'Won't Do'. > Allow replicas that do not exist to vote > > > Key: KUDU-2169 > URL: https://issues.apache.org/jira/browse/KUDU-2169 > Project: Kudu > Issue Type: Sub-task > Components: consensus >Reporter: Mike Percy >Priority: Major > Fix For: n/a > > > In certain scenarios it is desirable for replicas that do not exist on a > tablet server to be able to vote. After the implementation of KUDU-871, > tombstoned tablets are now able to vote. However, there are circumstances (at > least in a pre- KUDU-1097 world) where voters that do not have a copy of a > replica (running or tombstoned) would be needed to vote to ensure > availability in certain edge-case failure scenarios. > The quick justification for why it would be safe for a non-existent replica > to vote is that it would be equivalent to a replica that has simply not yet > replicated any WAL entries, in which case it would be legal to vote for any > candidate. Of course, a candidate would only ask such a replica to vote for > it if it believed that replica to be a voter in its config. > Some additional discussion can be found here: > https://github.com/apache/kudu/blob/master/docs/design-docs/raft-tablet-copy.md#should-a-server-be-allowed-to-vote-if-it-does_not_exist-or-is-deleted > What follows is an example of a scenario where "non-existent" replicas being > able to vote would be desired: > In a 3-2-3 re-replication paradigm, the leader (A) of a 3-replica config \{A, > B, C\} evicts one replica (C). Then, the leader (A) adds a new voter (D). > Before A is able to replicate this config change to B or D, A is partitioned > from a network perspective. However A writes this config change to its local > WAL. After this, the entire cluster is brought down, the network is restored, > and the entire cluster is restarted. However, B fails to come back online due > to a hardware failure. > The only way to automatically recover in this scenario is to allow D, which > has no concept of the tablet being discussed, to vote for A to become leader, > which will then tablet copy to D and make the tablet available for writes. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KUDU-2169) Allow replicas that do not exist to vote
[ https://issues.apache.org/jira/browse/KUDU-2169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin resolved KUDU-2169. - Fix Version/s: n/a Resolution: Won't Do Now we have the 3-4-3 replica management scheme, and we don't use 3-2-3 scheme anymore. With the 3-4-3 scheme there are scenarios where the system first evicts a replica, and then adds a new non-voter replica: that's when the replica to be evicted fails behind WAL segment GC threshold or experience a disk failure. In very rare cases it might happen that a tablet ends up with leader replica A, and replica A cannot replicate/commit the change in the Raft configuration as described. >From the other side, such a newly replica D in case of the 3-4-3 scheme is a >non-voter, and it cannot vote by definition. In other words, some manual intervention would be necessary in the described scenario, but not the way how this JIRA proposes. Closing as 'Won't Do'. > Allow replicas that do not exist to vote > > > Key: KUDU-2169 > URL: https://issues.apache.org/jira/browse/KUDU-2169 > Project: Kudu > Issue Type: Sub-task > Components: consensus >Reporter: Mike Percy >Priority: Major > Fix For: n/a > > > In certain scenarios it is desirable for replicas that do not exist on a > tablet server to be able to vote. After the implementation of KUDU-871, > tombstoned tablets are now able to vote. However, there are circumstances (at > least in a pre- KUDU-1097 world) where voters that do not have a copy of a > replica (running or tombstoned) would be needed to vote to ensure > availability in certain edge-case failure scenarios. > The quick justification for why it would be safe for a non-existent replica > to vote is that it would be equivalent to a replica that has simply not yet > replicated any WAL entries, in which case it would be legal to vote for any > candidate. Of course, a candidate would only ask such a replica to vote for > it if it believed that replica to be a voter in its config. > Some additional discussion can be found here: > https://github.com/apache/kudu/blob/master/docs/design-docs/raft-tablet-copy.md#should-a-server-be-allowed-to-vote-if-it-does_not_exist-or-is-deleted > What follows is an example of a scenario where "non-existent" replicas being > able to vote would be desired: > In a 3-2-3 re-replication paradigm, the leader (A) of a 3-replica config \{A, > B, C\} evicts one replica (C). Then, the leader (A) adds a new voter (D). > Before A is able to replicate this config change to B or D, A is partitioned > from a network perspective. However A writes this config change to its local > WAL. After this, the entire cluster is brought down, the network is restored, > and the entire cluster is restarted. However, B fails to come back online due > to a hardware failure. > The only way to automatically recover in this scenario is to allow D, which > has no concept of the tablet being discussed, to vote for A to become leader, > which will then tablet copy to D and make the tablet available for writes. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KUDU-1621) Flush lingering operations upon destruction of an AUTO_FLUSH_BACKGROUND session
[ https://issues.apache.org/jira/browse/KUDU-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin resolved KUDU-1621. - Fix Version/s: n/a Resolution: Won't Fix Automatically flushing data in the {{KuduSession}} might block, indeed. It seems the current approach of issuing a warning when data is not flushed is good enough: it's uniform across all flush modes and avoids application hanging on close. > Flush lingering operations upon destruction of an AUTO_FLUSH_BACKGROUND > session > --- > > Key: KUDU-1621 > URL: https://issues.apache.org/jira/browse/KUDU-1621 > Project: Kudu > Issue Type: Improvement > Components: client >Affects Versions: 1.0.0 >Reporter: Alexey Serbin >Priority: Major > Fix For: n/a > > > In current implementation of AUTO_FLUSH_BACKGROUND mode, it's necessary to > call KuduSession::Flush() or KuduSession::FlushAsync() explicitly before > destroying/abandoning a session if it's desired to have any pending > operations flushed. > As [~adar] noticed during review of https://gerrit.cloudera.org/#/c/4432/ , > it might make sense to change this behavior to automatically flush any > pending operations upon closing Kudu AUTO_FLUSH_BACKGROUND session. That > would be more consistent with the semantics of the AUTO_FLUSH_BACKGROUND mode > and more user-friendly. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KUDU-3131) test rw_mutex-test hangs sometimes if build_type is release
[ https://issues.apache.org/jira/browse/KUDU-3131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17120608#comment-17120608 ] Alexey Serbin commented on KUDU-3131: - Hi [~RenhaiZhao], at the server where I do a lot of compilation/testing the version of glibc is {{2.12-1.149.el6_6.9}}. It's a really old installation: CentOS6.6 > test rw_mutex-test hangs sometimes if build_type is release > --- > > Key: KUDU-3131 > URL: https://issues.apache.org/jira/browse/KUDU-3131 > Project: Kudu > Issue Type: Sub-task >Reporter: huangtianhua >Priority: Major > > Built and test kudu on aarch64, in release mode there is a test hangs > sometimes(maybe a deadlock?) the console out as following: > [==] Running 2 tests from 1 test case. > [--] Global test environment set-up. > [--] 2 tests from Priorities/RWMutexTest > [ RUN ] Priorities/RWMutexTest.TestDeadlocks/0 > And seems it's ok in debug mode. > Now only this one test failed sometimes on aarch64, [~aserbin] [~adar] would > you please have a look for this? Or give some suggestion to us, thanks very > much. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (KUDU-368) Run local benchmarks under perf-stat
[ https://issues.apache.org/jira/browse/KUDU-368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin reassigned KUDU-368: -- Assignee: Alexey Serbin > Run local benchmarks under perf-stat > > > Key: KUDU-368 > URL: https://issues.apache.org/jira/browse/KUDU-368 > Project: Kudu > Issue Type: Improvement > Components: test >Affects Versions: M4.5 >Reporter: Todd Lipcon >Assignee: Alexey Serbin >Priority: Minor > Labels: benchmarks, perf > > Would be nice to run a lot of our nightly benchmarks under perf-stat so we > can see on regression what factors changed (eg instruction count, cycles, > stalled cycles, cache misses, etc) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KUDU-2604) Add label for tserver
[ https://issues.apache.org/jira/browse/KUDU-2604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17118788#comment-17118788 ] Alexey Serbin commented on KUDU-2604: - [~granthenke], yes, I think the remaining functionality can be broken down into smaller JIRA items. At the higher level, I see the following pieces: * Define and assign tags to tablet servers. * Update master's placement policies to take into account tags when adding/distributing replicas of tablets. * Add support for C++ and Java clients: clients can specify set of tags when creating tables. * The {{kudu cluster rebalance}} tool and the auto-rebalancer honors the tags when rebalancing corresponding tables. The tools is also able to report on tablet replicas which are placed in a non-conforming way w.r.t. tags specified for tables (those con-conformant-placed replicas might appear during automatic re-replication: this is something similar that we have with current placement policies). * The {{kudu cluster ksck}} CLI tool provides information on tags for tablet servers. We can create sub-tasks for this if we decide to implement this. > Add label for tserver > - > > Key: KUDU-2604 > URL: https://issues.apache.org/jira/browse/KUDU-2604 > Project: Kudu > Issue Type: New Feature >Reporter: Hong Shen >Priority: Major > Labels: location-awareness, rack-awareness > Fix For: n/a > > Attachments: image-2018-10-15-21-52-21-426.png > > > When the cluster is bigger and bigger, big table with a lot of tablets will > be distributed in almost all the tservers, when client write batch to the big > table, it may cache connections to lots of tservers, the scalability may > constrained. > If the tablets in one table or partition only in a part of tservers, client > will only have to cache connections to the part's tservers. So we propose to > add label to tservers, each tserver belongs to a unique label. Client > specified label when create table or add partition, the tablets will only be > created on the tservers in specified label, if not specified, defalut label > will be used. > It will also benefit for: > 1 Tserver across data center. > 2 Heterogeneous tserver, like different disk, cpu or memory. > 3 Physical isolating, especially IO, isolate some tables with others. > 4 Gated Launch, upgrade tservers one by one label. > In our product cluster, we have encounter the above issues and need to be > resolved. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Reopened] (KUDU-2604) Add label for tserver
[ https://issues.apache.org/jira/browse/KUDU-2604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin reopened KUDU-2604: - It seems this JIRA item contains some useful ideas and details which are orthogonal to current implementation of the rack awareness feature. If implemented, they might complement the overall functionality of the placement policies in Kudu. I'm removing the duplicate of KUDU-1535 resolution. > Add label for tserver > - > > Key: KUDU-2604 > URL: https://issues.apache.org/jira/browse/KUDU-2604 > Project: Kudu > Issue Type: New Feature >Reporter: Hong Shen >Priority: Major > Labels: location-awareness, rack-awareness > Fix For: n/a > > Attachments: image-2018-10-15-21-52-21-426.png > > > When the cluster is bigger and bigger, big table with a lot of tablets will > be distributed in almost all the tservers, when client write batch to the big > table, it may cache connections to lots of tservers, the scalability may > constrained. > If the tablets in one table or partition only in a part of tservers, client > will only have to cache connections to the part's tservers. So we propose to > add label to tservers, each tserver belongs to a unique label. Client > specified label when create table or add partition, the tablets will only be > created on the tservers in specified label, if not specified, defalut label > will be used. > It will also benefit for: > 1 Tserver across data center. > 2 Heterogeneous tserver, like different disk, cpu or memory. > 3 Physical isolating, especially IO, isolate some tables with others. > 4 Gated Launch, upgrade tservers one by one label. > In our product cluster, we have encounter the above issues and need to be > resolved. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KUDU-1865) Create fast path for RespondSuccess() in KRPC
[ https://issues.apache.org/jira/browse/KUDU-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17117387#comment-17117387 ] Alexey Serbin commented on KUDU-1865: - Some more stacks captured from diagnostic logs for {{kudu-master}} process (kudu 1.10): {noformat} Stacks at 0516 18:53:00.042003 (service queue overflowed for kudu.master.MasterService): tids=[736230] 0x7f803a76a5e0 0xb6219e tcmalloc::ThreadCache::ReleaseToCentralCache() 0xb62530 tcmalloc::ThreadCache::Scavenge() 0xad8a27 kudu::master::CatalogManager::ScopedLeaderSharedLock::ScopedLeaderSharedLock() 0xaa3a31 kudu::master::MasterServiceImpl::GetTableSchema() 0x221aaa9 kudu::rpc::GeneratedServiceIf::Handle() 0x221b1a9 kudu::rpc::ServicePool::RunThread() 0x23a8f84 kudu::Thread::SuperviseThread() 0x7f803a762e25 start_thread 0x7f8038a4134d __clone tids=[736248,736245,736243,736242] 0x7f803a76a5e0 0x23c5b44 base::internal::SpinLockDelay() 0x23c59cc base::SpinLock::SlowLock() 0xac5814 kudu::master::CatalogManager::CheckOnline() 0xae5032 kudu::master::CatalogManager::GetTableSchema() 0xaa3a85 kudu::master::MasterServiceImpl::GetTableSchema() 0x221aaa9 kudu::rpc::GeneratedServiceIf::Handle() 0x221b1a9 kudu::rpc::ServicePool::RunThread() 0x23a8f84 kudu::Thread::SuperviseThread() 0x7f803a762e25 start_thread 0x7f8038a4134d __clone tids=[736239,736229,736232,736233,736234,736235,736236,736237,736238,736240,736241,736244,736247] 0x7f803a76a5e0 0x23c5b44 base::internal::SpinLockDelay() 0x23c59cc base::SpinLock::SlowLock() 0xac5814 kudu::master::CatalogManager::CheckOnline() 0xaf102f kudu::master::CatalogManager::GetTableLocations() 0xaa36f8 kudu::master::MasterServiceImpl::GetTableLocations() 0x221aaa9 kudu::rpc::GeneratedServiceIf::Handle() 0x221b1a9 kudu::rpc::ServicePool::RunThread() 0x23a8f84 kudu::Thread::SuperviseThread() 0x7f803a762e25 start_thread 0x7f8038a4134d __clone tids=[736246,736231] 0x7f803a76a5e0 0x23c5b44 base::internal::SpinLockDelay() 0x23c59cc base::SpinLock::SlowLock() 0xad8b7c kudu::master::CatalogManager::ScopedLeaderSharedLock::ScopedLeaderSharedLock() 0xaa369d kudu::master::MasterServiceImpl::GetTableLocations() 0x221aaa9 kudu::rpc::GeneratedServiceIf::Handle() 0x221b1a9 kudu::rpc::ServicePool::RunThread() 0x23a8f84 kudu::Thread::SuperviseThread() 0x7f803a762e25 start_thread 0x7f8038a4134d __clone {noformat} > Create fast path for RespondSuccess() in KRPC > - > > Key: KUDU-1865 > URL: https://issues.apache.org/jira/browse/KUDU-1865 > Project: Kudu > Issue Type: Improvement > Components: rpc >Reporter: Sailesh Mukil >Priority: Major > Labels: perfomance, rpc > Attachments: alloc-pattern.py, cross-thread.txt > > > A lot of RPCs just respond with RespondSuccess() which returns the exact > payload every time. This takes the same path as any other response by > ultimately calling Connection::QueueResponseForCall() which has a few small > allocations. These small allocations (and their corresponding deallocations) > are called quite frequently (once for every IncomingCall) and end up taking > quite some time in the kernel (traversing the free list, spin locks etc.) > This was found when [~mmokhtar] ran some profiles on Impala over KRPC on a 20 > node cluster and found the following: > The exact % of time spent is hard to quantify from the profiles, but these > were the among the top 5 of the slowest stacks: > {code:java} > impalad ! tcmalloc::CentralFreeList::ReleaseToSpans - [unknown source file] > impalad ! tcmalloc::CentralFreeList::ReleaseListToSpans + 0x1a - [unknown > source file] > impalad ! tcmalloc::CentralFreeList::InsertRange + 0x3b - [unknown source > file] > impalad ! tcmalloc::ThreadCache::ReleaseToCentralCache + 0x103 - [unknown > source file] > impalad ! tcmalloc::ThreadCache::Scavenge + 0x3e - [unknown source file] > impalad ! operator delete + 0x329 - [unknown source file] > impalad ! __gnu_cxx::new_allocator::deallocate + 0x4 - > new_allocator.h:110 > impalad ! std::_Vector_base std::allocator>::_M_deallocate + 0x5 - stl_vector.h:178 > impalad ! ~_Vector_base + 0x4 - stl_vector.h:160 > impalad ! ~vector - stl_vector.h:425'slices' vector > impalad ! kudu::rpc::Connection::QueueResponseForCall + 0xac - > connection.cc:433 > impalad ! kudu::rpc::InboundCall::Respond + 0xfa - inbound_call.cc:133 > impalad ! kud
[jira] [Commented] (KUDU-3131) test rw_mutex-test hangs sometimes if build_type is release
[ https://issues.apache.org/jira/browse/KUDU-3131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17117008#comment-17117008 ] Alexey Serbin commented on KUDU-3131: - I cannot reproduce this on x86_64 architecture and I don't have access to aarch64 at this point. I'd try to attach to the hung process with a debugger and see what's going on. [~huangtianhua], did you have a chance to try that? > test rw_mutex-test hangs sometimes if build_type is release > --- > > Key: KUDU-3131 > URL: https://issues.apache.org/jira/browse/KUDU-3131 > Project: Kudu > Issue Type: Sub-task >Reporter: huangtianhua >Priority: Major > > Built and test kudu on aarch64, in release mode there is a test hangs > sometimes(maybe a deadlock?) the console out as following: > [==] Running 2 tests from 1 test case. > [--] Global test environment set-up. > [--] 2 tests from Priorities/RWMutexTest > [ RUN ] Priorities/RWMutexTest.TestDeadlocks/0 > And seems it's ok in debug mode. > Now only this one test failed sometimes on aarch64, [~aserbin] [~adar] would > you please have a look for this? Or give some suggestion to us, thanks very > much. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KUDU-3107) TestRpc.TestCancellationMultiThreads fail on ARM sometimes due to service queue is full
[ https://issues.apache.org/jira/browse/KUDU-3107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin resolved KUDU-3107. - Fix Version/s: NA Resolution: Cannot Reproduce > TestRpc.TestCancellationMultiThreads fail on ARM sometimes due to service > queue is full > --- > > Key: KUDU-3107 > URL: https://issues.apache.org/jira/browse/KUDU-3107 > Project: Kudu > Issue Type: Sub-task >Reporter: liusheng >Priority: Major > Fix For: NA > > Attachments: rpc-test.txt > > > The test TestRpc.TestCancellationMultiThreads fail sometimes on ARM mechine > due the the "service queue full" error. related error message: > {code:java} > Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 > (request call id 318) > Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 > (request call id 319) > Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 > (request call id 320) > Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 > (request call id 321) > Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 > (request call id 324) > Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 > (request call id 332) > Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 > (request call id 334) > Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 > (request call id 335) > Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 > (request call id 336) > Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 > (request call id 337) > Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 > (request call id 338) > Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 > (request call id 339) > Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 > (request call id 340) > Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 > (request call id 341) > F0416 13:01:38.616358 31937 rpc-test.cc:1471] Check failed: > controller.status().IsAborted() || controller.status().IsServiceUnavailable() > || controller.status().ok() Remote error: Service unavailable: PushStrings > request on kudu.rpc.GenericCalculatorService from 127.0.0.1:41516 dropped due > to backpressure. The service queue is full; it has 100 items. > *** Check failure stack trace: *** > PC: @0x0 (unknown) > *** SIGABRT (@0x3e86bbf) received by PID 27583 (TID 0x84b1f050) from > PID 27583; stack trace: *** > @ 0x93cf0464 raise at ??:0 > @ 0x93cf18b4 abort at ??:0 > @ 0x942c5fdc google::logging_fail() at ??:0 > @ 0x942c7d40 google::LogMessage::Fail() at ??:0 > @ 0x942c9c78 google::LogMessage::SendToLog() at ??:0 > @ 0x942c7874 google::LogMessage::Flush() at ??:0 > @ 0x942ca4fc google::LogMessageFatal::~LogMessageFatal() at ??:0 > @ 0xdcee4940 kudu::rpc::SendAndCancelRpcs() at ??:0 > @ 0xdcee4b98 > _ZZN4kudu3rpc41TestRpc_TestCancellationMultiThreads_Test8TestBodyEvENKUlvE_clEv > at ??:0 > @ 0xdcee76bc > _ZSt13__invoke_implIvZN4kudu3rpc41TestRpc_TestCancellationMultiThreads_Test8TestBodyEvEUlvE_JEET_St14__invoke_otherOT0_DpOT1_ > at ??:0 > @ 0xdcee7484 > _ZSt8__invokeIZN4kudu3rpc41TestRpc_TestCancellationMultiThreads_Test8TestBodyEvEUlvE_JEENSt15__invoke_resultIT_JDpT0_EE4typeEOS5_DpOS6_ > at ??:0 > @ 0xdcee8208 > _ZNSt6thread8_InvokerISt5tupleIJZN4kudu3rpc41TestRpc_TestCancellationMultiThreads_Test8TestBodyEvEUlvE_EEE9_M_invokeIJLm0DTcl8__invokespcl10_S_declvalIXT_ESt12_Index_tupleIJXspT_EEE > at ??:0 > @ 0xdcee8168 > _ZNSt6thread8_InvokerISt5tupleIJZN4kudu3rpc41TestRpc_TestCancellationMultiThreads_Test8TestBodyEvEUlvE_EEEclEv > at ??:0 > @ 0xdcee8110 > _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN4kudu3rpc41TestRpc_TestCancellationMultiThreads_Test8TestBodyEvEUlvE_E6_M_runEv > at ??:0 > @ 0x93f22e94 (unknown) at ??:0 > @ 0x93e1e088 start_thread at ??:0 > @ 0x93d8e4ec (unknown) at ??:0 > {code} > The attatchment is the full test log -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KUDU-2453) kudu should stop creating tablet infinitely
[ https://issues.apache.org/jira/browse/KUDU-2453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110649#comment-17110649 ] Alexey Serbin commented on KUDU-2453: - There is a reproduction scenario for the issue described in this JIRA: https://gerrit.cloudera.org/#/c/15912/ > kudu should stop creating tablet infinitely > --- > > Key: KUDU-2453 > URL: https://issues.apache.org/jira/browse/KUDU-2453 > Project: Kudu > Issue Type: Bug > Components: master, tserver >Affects Versions: 1.4.0, 1.7.2 >Reporter: LiFu He >Priority: Major > > I have met this problem again on 2018/10/26. And now the kudu version is > 1.7.2. > - > We modified the flag 'max_create_tablets_per_ts' (2000) of master.conf, and > there are some load on the kudu cluster. Then someone else created a big > table which had tens of thousands of tablets from impala-shell (that was a > mistake). > {code:java} > CREATE TABLE XXX( > ... >PRIMARY KEY (...) > ) > PARTITION BY HASH (...) PARTITIONS 100, > RANGE (...) > ( > PARTITION "2018-10-24" <= VALUES < "2018-10-24\000", > PARTITION "2018-10-25" <= VALUES < "2018-10-25\000", > ... > PARTITION "2018-12-07" <= VALUES < "2018-12-07\000" > ) > STORED AS KUDU > TBLPROPERTIES ('kudu.master_addresses'= '...'); > {code} > Here are the logs after creating table (only pick one tablet as example): > {code:java} > --Kudu-master log > ==e884bda6bbd3482f94c07ca0f34f99a4== > W1024 11:40:51.914397 180146 catalog_manager.cc:2664] TS > 39f15fcf42ef45bba0c95a3223dc25ee (kudu2.lt.163.org:7050): Create Tablet RPC > failed for tablet e884bda6bbd3482f94c07ca0f34f99a4: Remote error: Service > unavailable: CreateTablet request on kudu.tserver.TabletServerAdminService > from 10.120.219.118:50247 dropped due to backpressure. The service queue is > full; it has 512 items. > I1024 11:40:51.914412 180146 catalog_manager.cc:2700] Scheduling retry of > CreateTablet RPC for tablet e884bda6bbd3482f94c07ca0f34f99a4 on TS > 39f15fcf42ef45bba0c95a3223dc25ee with a delay of 42 ms (attempt = 1) > ... > ==Be replaced by 0b144c00f35d48cca4d4981698faef72== > W1024 11:41:22.114512 180202 catalog_manager.cc:3949] T > P f6c9a09da7ef4fc191cab6276b942ba3: Tablet > e884bda6bbd3482f94c07ca0f34f99a4 (table quasi_realtime_user_feature > [id=946d6dd03ec544eab96231e5a03bed59]) was not created within the allowed > timeout. Replacing with a new tablet 0b144c00f35d48cca4d4981698faef72 > ... > I1024 11:41:22.391916 180202 catalog_manager.cc:3806] T > P f6c9a09da7ef4fc191cab6276b942ba3: Sending > DeleteTablet for 3 replicas of tablet e884bda6bbd3482f94c07ca0f34f99a4 > ... > I1024 11:41:22.391927 180202 catalog_manager.cc:2922] Sending > DeleteTablet(TABLET_DATA_DELETED) for tablet e884bda6bbd3482f94c07ca0f34f99a4 > on 39f15fcf42ef45bba0c95a3223dc25ee (kudu2.lt.163.org:7050) (Replaced by > 0b144c00f35d48cca4d4981698faef72 at 2018-10-24 11:41:22 CST) > ... > W1024 11:41:22.428129 180146 catalog_manager.cc:2892] TS > 39f15fcf42ef45bba0c95a3223dc25ee (kudu2.lt.163.org:7050): delete failed for > tablet e884bda6bbd3482f94c07ca0f34f99a4 with error code TABLET_NOT_RUNNING: > Already present: State transition of tablet e884bda6bbd3482f94c07ca0f34f99a4 > already in progress: creating tablet > ... > I1024 11:41:22.428143 180146 catalog_manager.cc:2700] Scheduling retry of > e884bda6bbd3482f94c07ca0f34f99a4 Delete Tablet RPC for > TS=39f15fcf42ef45bba0c95a3223dc25ee with a delay of 35 ms (attempt = 1) > ... > W1024 11:41:22.683702 180145 catalog_manager.cc:2664] TS > b251540e606b4863bb576091ff961892 (kudu1.lt.163.org:7050): Create Tablet RPC > failed for tablet 0b144c00f35d48cca4d4981698faef72: Remote error: Service > unavailable: CreateTablet request on kudu.tserver.TabletServerAdminService > from 10.120.219.118:59735 dropped due to backpressure. The service queue is > full; it has 512 items. > I1024 11:41:22.683717 180145 catalog_manager.cc:2700] Scheduling retry of > CreateTablet RPC for tablet 0b144c00f35d48cca4d4981698faef72 on TS > b251540e606b4863bb576091ff961892 with a delay of 46 ms (attempt = 1) > ... > ==Be replaced by c0e0acc448fc42fc9e48f5025b112a75== > W1024 11:41:52.775420 180202 catalog_manager.cc:3949] T > P f6c9a09da7ef4fc191cab6276b942ba3: Tablet > 0b144c00f35d48cca4d4981698faef72 (table quasi_realtime_user_feature > [id=946d6dd03ec544eab96231e5a03bed59]) was not created within the allowed > timeout. Replacing with a new tablet c0e0acc448fc42fc9e48f5025b112a75 > ... > --Kudu-tserver log > I1024 11:40:52.014571 13
[jira] [Created] (KUDU-3124) A safer way to handle CreateTablet requests
Alexey Serbin created KUDU-3124: --- Summary: A safer way to handle CreateTablet requests Key: KUDU-3124 URL: https://issues.apache.org/jira/browse/KUDU-3124 Project: Kudu Issue Type: Improvement Components: master, tserver Affects Versions: 1.11.1, 1.11.0, 1.10.1, 1.10.0, 1.9.0, 1.7.1, 1.8.0, 1.7.0, 1.6.0, 1.5.0, 1.4.0, 1.3.1, 1.3.0, 1.2.0 Reporter: Alexey Serbin As of now, catalog manager (a part of kudu-master) sends {{CreateTabletRequest}} RPC as soon as they are realized by {{CatalogManager::ProcessPendingAssignments()}} when processing the list of deferred DDL operations, and at this level there isn't any restrictions on how many of those might be in flight or sent to a particular tablet server (NOTE: there is {{\-\-max_create_tablets_per_ts}} flag, but it works on a higher level and only during initial creation of a table). The {{CreateTablet}} requests are sent asynchronously, and if the tablet isn't created within {{\-\-tablet_creation_timeout_ms|| milliseconds, catalog manager replaces all the tablet replicas, generating a new tablet UUID and sending corresponding {{CreateTabletRequest}} RPCs to a potentially different set of tablet servers. Corresponding {{DeleteTabletRequest}} RPCs (to remove the replicas of the stalled-during-creation tablet) are sent separately in an asynchronous way as well. There are at least two issues with this approach: # The {{\-\-max_create_tablets_per_ts}} threshold limits the number of concurrent requests hitting one tablet server only during the initial creation of a table. However, nothing limits how many requests to create a table replica might hit a tablet server when adding partitions to an existing table as a result of ALTER TABLE request. # {{DeleteTabletRequest}} RPCs sometimes might not get into the RPC queues of corresponding tablet servers, and catalog manager stops retrying sending those after {{\-\-unresponsive_ts_rpc_timeout_ms}} interval. This might spiral into a situation when requests to create replacement tablet replicas are passing through and executed by tablet servers, but corresponding requests to delete tablet replica cannot get through because of queue overflows, with catalog manager eventually giving up retrying the latter ones. Eventually, tablet servers end up with huge number of tablet replicas created, and they crash running out of memory. The crashed tablet servers cannot start after that because they eventually run out of memory trying to bootstrap the huge number of tablet replicas (running out of memory again). See https://gerrit.cloudera.org/#/c/15912/ for the reproduction scenario and [KUDU-2453|https://issues.apache.org/jira/browse/KUDU-2453] for corresponding issue reported some time ago. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-3000) RemoteKsckTest.TestChecksumSnapshot sometimes fails
[ https://issues.apache.org/jira/browse/KUDU-3000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3000: Attachment: ksck_remote-test.01.txt.xz > RemoteKsckTest.TestChecksumSnapshot sometimes fails > --- > > Key: KUDU-3000 > URL: https://issues.apache.org/jira/browse/KUDU-3000 > Project: Kudu > Issue Type: Bug > Components: test >Affects Versions: 1.10.0, 1.10.1, 1.11.0 >Reporter: Alexey Serbin >Priority: Major > Attachments: ksck_remote-test.01.txt.xz, ksck_remote-test.txt.xz > > > The {{TestChecksumSnapshot}} scenario of the {{RemoteKsckTest}} test > sometimes fails with the following error message: > {noformat} > W1116 06:46:18.593114 3904 tablet_service.cc:2365] Rejecting scan request > for tablet 4ce9988aac744b > 1bbde2772c66cce35d: Uninitialized: safe time has not yet been initialized > src/kudu/tools/ksck_remote-test.cc:407: Failure > Failed > {noformat} > Full log is attached. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KUDU-3000) RemoteKsckTest.TestChecksumSnapshot sometimes fails
[ https://issues.apache.org/jira/browse/KUDU-3000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17105098#comment-17105098 ] Alexey Serbin commented on KUDU-3000: - Another failure (probably, this time the root cause different): {noformat} /data0/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/tools/ksck_remote-test.cc:407: Failure Failed Bad status: Aborted: 1 errors were detected {noformat} The log is attached. [^ksck_remote-test.01.txt.xz] > RemoteKsckTest.TestChecksumSnapshot sometimes fails > --- > > Key: KUDU-3000 > URL: https://issues.apache.org/jira/browse/KUDU-3000 > Project: Kudu > Issue Type: Bug > Components: test >Affects Versions: 1.10.0, 1.10.1, 1.11.0 >Reporter: Alexey Serbin >Priority: Major > Attachments: ksck_remote-test.01.txt.xz, ksck_remote-test.txt.xz > > > The {{TestChecksumSnapshot}} scenario of the {{RemoteKsckTest}} test > sometimes fails with the following error message: > {noformat} > W1116 06:46:18.593114 3904 tablet_service.cc:2365] Rejecting scan request > for tablet 4ce9988aac744b > 1bbde2772c66cce35d: Uninitialized: safe time has not yet been initialized > src/kudu/tools/ksck_remote-test.cc:407: Failure > Failed > {noformat} > Full log is attached. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KUDU-3120) testHiveMetastoreIntegration(org.apache.kudu.test.TestMiniKuduCluster) sometimes fails with timeout
Alexey Serbin created KUDU-3120: --- Summary: testHiveMetastoreIntegration(org.apache.kudu.test.TestMiniKuduCluster) sometimes fails with timeout Key: KUDU-3120 URL: https://issues.apache.org/jira/browse/KUDU-3120 Project: Kudu Issue Type: Bug Components: test Reporter: Alexey Serbin Attachments: test-output.txt.xz The subj sometimes fails due to timeout: {noformat} Time: 56.114 There was 1 failure: 1) testHiveMetastoreIntegration(org.apache.kudu.test.TestMiniKuduCluster) org.junit.runners.model.TestTimedOutException: test timed out after 5 milliseconds at java.io.FileInputStream.readBytes(Native Method) at java.io.FileInputStream.read(FileInputStream.java:255) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read(BufferedInputStream.java:265) at java.io.DataInputStream.readInt(DataInputStream.java:387) at org.apache.kudu.test.cluster.MiniKuduCluster.sendRequestToCluster(MiniKuduCluster.java:162) at org.apache.kudu.test.cluster.MiniKuduCluster.start(MiniKuduCluster.java:235) at org.apache.kudu.test.cluster.MiniKuduCluster.access$300(MiniKuduCluster.java:72) at org.apache.kudu.test.cluster.MiniKuduCluster$MiniKuduClusterBuilder.build(MiniKuduCluster.java:697) at org.apache.kudu.test.TestMiniKuduCluster.testHiveMetastoreIntegration(TestMiniKuduCluster.java:106) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298) at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.lang.Thread.run(Thread.java:748) {noformat} The log is attached. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KUDU-3119) ToolTest.TestFsAddRemoveDataDirEndToEnd reports race under TSAN
Alexey Serbin created KUDU-3119: --- Summary: ToolTest.TestFsAddRemoveDataDirEndToEnd reports race under TSAN Key: KUDU-3119 URL: https://issues.apache.org/jira/browse/KUDU-3119 Project: Kudu Issue Type: Bug Components: CLI, test Reporter: Alexey Serbin Attachments: kudu-tool-test.log.xz Sometimes the {{TestFsAddRemoveDataDirEndToEnd}} scenario of the {{ToolTest}} reports races for TSAN builds: {noformat} /data0/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/tools/kudu-tool-test.cc:266: Failure Failed Bad status: Runtime error: /tmp/dist-test-taskIZqSmU/build/tsan/bin/kudu: process exited with non-ze ro status 66 Google Test trace: /data0/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/tools/kudu-tool-test.cc:265: W0506 17:5 6:02.744191 4432 flags.cc:404] Enabled unsafe flag: --never_fsync=true I0506 17:56:02.780252 4432 fs_manager.cc:263] Metadata directory not provided I0506 17:56:02.780442 4432 fs_manager.cc:269] Using write-ahead log directory (fs_wal_dir) as metad ata directory I0506 17:56:02.789638 4432 fs_manager.cc:399] Time spent opening directory manager: real 0.007s user 0.005s sys 0.002s I0506 17:56:02.789986 4432 env_posix.cc:1676] Not raising this process' open files per process limi t of 1048576; it is already as high as it can go I0506 17:56:02.790426 4432 file_cache.cc:465] Constructed file cache lbm with capacity 419430 == WARNING: ThreadSanitizer: data race (pid=4432) ... {noformat} The log is attached. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-3117) TServerQuiescingITest.TestMajorityQuiescingElectsLeader sometimes fails
[ https://issues.apache.org/jira/browse/KUDU-3117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3117: Description: The {{TServerQuiescingITest.TestMajorityQuiescingElectsLeader}} test scenario sometimes fails (TSAN builds) with the messages like below: {noformat} kudu/integration-tests/tablet_server_quiescing-itest.cc:228: Failure Failed Bad status: Timed out: Unable to find leader of tablet 65dba49fff3f45678638414af2fe83e4 after 10.005s. Status message: Not found: Error connecting to replica: Timed out: GetConsensusState RPC to 127.0.177.65:42397 timed out after -0.003s (SENT) kudu/util/test_util.cc:349: Failure Failed {noformat} The log is attached. was: The {{TServerQuiescingITest.TestMajorityQuiescingElectsLeader}} test scenario sometimes fails (TSAN builds) with the messages like below: {noformat} kudu/integration-tests/tablet_server_quiescing-itest.cc:228: Failure Failed Bad status: Timed out: Unable to find leader of tablet 65dba49fff3f45678638414af2fe83e4 after 10.005s. Status message: Not found: Error connecting to replica: Timed out: GetConsensusState RPC to 127.0.177.65:42397 timed out after -0.003s (SENT) kudu/util/test_util.cc:349: Failure Failed {noformat} > TServerQuiescingITest.TestMajorityQuiescingElectsLeader sometimes fails > --- > > Key: KUDU-3117 > URL: https://issues.apache.org/jira/browse/KUDU-3117 > Project: Kudu > Issue Type: Bug > Components: test >Affects Versions: 1.12.0 >Reporter: Alexey Serbin >Priority: Minor > Attachments: tablet_server_quiescing-itest.txt.xz > > > The {{TServerQuiescingITest.TestMajorityQuiescingElectsLeader}} test scenario > sometimes fails (TSAN builds) with the messages like below: > {noformat} > kudu/integration-tests/tablet_server_quiescing-itest.cc:228: Failure > Failed > > Bad status: Timed out: Unable to find leader of tablet > 65dba49fff3f45678638414af2fe83e4 after 10.005s. Status message: Not found: > Error connecting to replica: Timed out: GetConsensusState RPC to > 127.0.177.65:42397 timed out after -0.003s (SENT) > kudu/util/test_util.cc:349: Failure > Failed > {noformat} > The log is attached. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-3117) TServerQuiescingITest.TestMajorityQuiescingElectsLeader sometimes fails
[ https://issues.apache.org/jira/browse/KUDU-3117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3117: Attachment: tablet_server_quiescing-itest.txt.xz > TServerQuiescingITest.TestMajorityQuiescingElectsLeader sometimes fails > --- > > Key: KUDU-3117 > URL: https://issues.apache.org/jira/browse/KUDU-3117 > Project: Kudu > Issue Type: Bug > Components: test >Affects Versions: 1.12.0 >Reporter: Alexey Serbin >Priority: Minor > Attachments: tablet_server_quiescing-itest.txt.xz > > > The {{TServerQuiescingITest.TestMajorityQuiescingElectsLeader}} test scenario > sometimes fails (TSAN builds) with the messages like below: > {noformat} > kudu/integration-tests/tablet_server_quiescing-itest.cc:228: Failure > Failed > > Bad status: Timed out: Unable to find leader of tablet > 65dba49fff3f45678638414af2fe83e4 after 10.005s. Status message: Not found: > Error connecting to replica: Timed out: GetConsensusState RPC to > 127.0.177.65:42397 timed out after -0.003s (SENT) > kudu/util/test_util.cc:349: Failure > Failed > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KUDU-3117) TServerQuiescingITest.TestMajorityQuiescingElectsLeader sometimes fails
Alexey Serbin created KUDU-3117: --- Summary: TServerQuiescingITest.TestMajorityQuiescingElectsLeader sometimes fails Key: KUDU-3117 URL: https://issues.apache.org/jira/browse/KUDU-3117 Project: Kudu Issue Type: Bug Components: test Affects Versions: 1.12.0 Reporter: Alexey Serbin The {{TServerQuiescingITest.TestMajorityQuiescingElectsLeader}} test scenario sometimes fails (TSAN builds) with the messages like below: {noformat} kudu/integration-tests/tablet_server_quiescing-itest.cc:228: Failure Failed Bad status: Timed out: Unable to find leader of tablet 65dba49fff3f45678638414af2fe83e4 after 10.005s. Status message: Not found: Error connecting to replica: Timed out: GetConsensusState RPC to 127.0.177.65:42397 timed out after -0.003s (SENT) kudu/util/test_util.cc:349: Failure Failed {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KUDU-3115) Improve scalability of Kudu masters
Alexey Serbin created KUDU-3115: --- Summary: Improve scalability of Kudu masters Key: KUDU-3115 URL: https://issues.apache.org/jira/browse/KUDU-3115 Project: Kudu Issue Type: Improvement Reporter: Alexey Serbin Currently, multiple masters in a multi-master Kudu cluster are used only for high availability & fault tolerance use cases, but not for sharing the load among the available master nodes. For example, Kudu clients detect current leader master upon connecting to the cluster and send all their subsequent requests to the leader master, so serving many more clients require running masters on more powerful nodes. Current design assumes that masters store and process the requests for metadata only, but that makes sense only up to some limit on the rate of incoming client requests. It would be great to achieve better 'horizontal' scalability for Kudu masters. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KUDU-3114) tserver writes core dump when reporting 'out of space'
[ https://issues.apache.org/jira/browse/KUDU-3114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17099242#comment-17099242 ] Alexey Serbin commented on KUDU-3114: - Right, it's possible to disable coredumps for Kudu processes by adding {{\-\-disable_core_dumps}} even if the limit for core files size of set to non-zero. My point was that enabling/disabling coredumps per {{LOG(FATAL)}} instance is not feasible. Dumping a core file might have sense when troubleshooting an issue: e.g., if there is a bug in computing the number of bytes to allocate, what event triggered the issue if it's requested to allocate unexpectedly high amount of space, etc. Probably, we can keep that for DEBUG builds only. I'm OK with keeping this JIRA item open (so, I'm re-opening it). Feel free to submit a patch to address the issue as needed. > tserver writes core dump when reporting 'out of space' > -- > > Key: KUDU-3114 > URL: https://issues.apache.org/jira/browse/KUDU-3114 > Project: Kudu > Issue Type: Bug > Components: tserver >Affects Versions: 1.7.1 >Reporter: Balazs Jeszenszky >Priority: Major > Fix For: n/a > > > Fatal log has: > {code} > F0503 23:56:27.359544 40012 status_callback.cc:35] Enqueued commit operation > failed to write to WAL: IO error: Insufficient disk space to allocate 8388608 > bytes under path (39973171200 bytes available vs 39988335247 bytes > reserved) (error 28) > {code} > Generating a core file in this case yields no benefit, and potentially > compounds the problem. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Reopened] (KUDU-3114) tserver writes core dump when reporting 'out of space'
[ https://issues.apache.org/jira/browse/KUDU-3114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin reopened KUDU-3114: - > tserver writes core dump when reporting 'out of space' > -- > > Key: KUDU-3114 > URL: https://issues.apache.org/jira/browse/KUDU-3114 > Project: Kudu > Issue Type: Bug > Components: tserver >Affects Versions: 1.7.1 >Reporter: Balazs Jeszenszky >Priority: Major > Fix For: n/a > > > Fatal log has: > {code} > F0503 23:56:27.359544 40012 status_callback.cc:35] Enqueued commit operation > failed to write to WAL: IO error: Insufficient disk space to allocate 8388608 > bytes under path (39973171200 bytes available vs 39988335247 bytes > reserved) (error 28) > {code} > Generating a core file in this case yields no benefit, and potentially > compounds the problem. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KUDU-3114) tserver writes core dump when reporting 'out of space'
[ https://issues.apache.org/jira/browse/KUDU-3114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin resolved KUDU-3114. - Fix Version/s: n/a Resolution: Information Provided > tserver writes core dump when reporting 'out of space' > -- > > Key: KUDU-3114 > URL: https://issues.apache.org/jira/browse/KUDU-3114 > Project: Kudu > Issue Type: Bug > Components: tserver >Affects Versions: 1.7.1 >Reporter: Balazs Jeszenszky >Priority: Major > Fix For: n/a > > > Fatal log has: > {code} > F0503 23:56:27.359544 40012 status_callback.cc:35] Enqueued commit operation > failed to write to WAL: IO error: Insufficient disk space to allocate 8388608 > bytes under path (39973171200 bytes available vs 39988335247 bytes > reserved) (error 28) > {code} > Generating a core file in this case yields no benefit, and potentially > compounds the problem. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KUDU-3114) tserver writes core dump when reporting 'out of space'
[ https://issues.apache.org/jira/browse/KUDU-3114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17099075#comment-17099075 ] Alexey Serbin commented on KUDU-3114: - Thank you for reporting the issue. The way how fatal inconsistencies are handled in Kudu doesn't provide control to choose between coredump behavior. The behavior of it's controlled at different level: the environment that Kudu processes are run with (check {{ulimit -c}}). As a good operational practice, it's advised to separate the location for core files (some directory at system partition/volume?) and the directories where Kudu stores its data and WAL. Also, consider [enabling mini-dumps in Kudu|https://kudu.apache.org/docs/troubleshooting.html#crash_reporting] and disabling core files if dumping cores isn't feasible due to space limitations. > tserver writes core dump when reporting 'out of space' > -- > > Key: KUDU-3114 > URL: https://issues.apache.org/jira/browse/KUDU-3114 > Project: Kudu > Issue Type: Bug > Components: tserver >Affects Versions: 1.7.1 >Reporter: Balazs Jeszenszky >Priority: Major > > Fatal log has: > {code} > F0503 23:56:27.359544 40012 status_callback.cc:35] Enqueued commit operation > failed to write to WAL: IO error: Insufficient disk space to allocate 8388608 > bytes under path (39973171200 bytes available vs 39988335247 bytes > reserved) (error 28) > {code} > Generating a core file in this case yields no benefit, and potentially > compounds the problem. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KUDU-3107) TestRpc.TestCancellationMultiThreads fail on ARM sometimes due to service queue is full
[ https://issues.apache.org/jira/browse/KUDU-3107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17096864#comment-17096864 ] Alexey Serbin commented on KUDU-3107: - I think the problem is that the code doesn't do proper conversion of the RPC-level status code into the application status code. I think the following is missing: {noformat} if (controller.status().IsRemoteError()) { const ErrorStatusPB* err = rpc->error_response(); CHECK(err && err->has_code() && (err->code() == ErrorStatusPB::ERROR_SERVER_TOO_BUSY || err->code() == ErrorStatusPB::ERROR_UNAVAILABLE)); } {noformat} > TestRpc.TestCancellationMultiThreads fail on ARM sometimes due to service > queue is full > --- > > Key: KUDU-3107 > URL: https://issues.apache.org/jira/browse/KUDU-3107 > Project: Kudu > Issue Type: Sub-task >Reporter: liusheng >Priority: Major > Attachments: rpc-test.txt > > > The test TestRpc.TestCancellationMultiThreads fail sometimes on ARM mechine > due the the "service queue full" error. related error message: > {code:java} > Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 > (request call id 318) > Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 > (request call id 319) > Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 > (request call id 320) > Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 > (request call id 321) > Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 > (request call id 324) > Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 > (request call id 332) > Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 > (request call id 334) > Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 > (request call id 335) > Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 > (request call id 336) > Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 > (request call id 337) > Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 > (request call id 338) > Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 > (request call id 339) > Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 > (request call id 340) > Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 > (request call id 341) > F0416 13:01:38.616358 31937 rpc-test.cc:1471] Check failed: > controller.status().IsAborted() || controller.status().IsServiceUnavailable() > || controller.status().ok() Remote error: Service unavailable: PushStrings > request on kudu.rpc.GenericCalculatorService from 127.0.0.1:41516 dropped due > to backpressure. The service queue is full; it has 100 items. > *** Check failure stack trace: *** > PC: @0x0 (unknown) > *** SIGABRT (@0x3e86bbf) received by PID 27583 (TID 0x84b1f050) from > PID 27583; stack trace: *** > @ 0x93cf0464 raise at ??:0 > @ 0x93cf18b4 abort at ??:0 > @ 0x942c5fdc google::logging_fail() at ??:0 > @ 0x942c7d40 google::LogMessage::Fail() at ??:0 > @ 0x942c9c78 google::LogMessage::SendToLog() at ??:0 > @ 0x942c7874 google::LogMessage::Flush() at ??:0 > @ 0x942ca4fc google::LogMessageFatal::~LogMessageFatal() at ??:0 > @ 0xdcee4940 kudu::rpc::SendAndCancelRpcs() at ??:0 > @ 0xdcee4b98 > _ZZN4kudu3rpc41TestRpc_TestCancellationMultiThreads_Test8TestBodyEvENKUlvE_clEv > at ??:0 > @ 0xdcee76bc > _ZSt13__invoke_implIvZN4kudu3rpc41TestRpc_TestCancellationMultiThreads_Test8TestBodyEvEUlvE_JEET_St14__invoke_otherOT0_DpOT1_ > at ??:0 > @ 0xdcee7484 > _ZSt8__invokeIZN4kudu3rpc41TestRpc_TestCancellationMultiThreads_Test8TestBodyEvEUlvE_JEENSt15__invoke_resultIT_JDpT0_EE4typeEOS5_DpOS6_ > at ??:0 > @ 0xdcee8208 > _ZNSt6thread8_InvokerISt5tupleIJZN4kudu3rpc41TestRpc_TestCancellationMultiThreads_Test8TestBodyEvEUlvE_EEE9_M_invokeIJLm0DTcl8__invokespcl10_S_declvalIXT_ESt12_Index_tupleIJXspT_EEE > at ??:0 > @ 0xdcee8168 > _ZNSt6thread8_InvokerISt5tupleIJZN4kudu3rpc41TestRpc_TestCancellationMultiThreads_Test8TestBodyEvEUlvE_EEEclEv > at ??:0 > @ 0xdcee8110 > _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN4kudu3rpc41TestRpc_TestCancellationMultiThreads_Test8TestBodyEvEUlvE_E6_M_runEv > at ??:0 > @ 0x93f22e94 (unknown) at ??:0 > @ 0x93e1e088 start_thread at ??:0 > @ 0x93d8e4ec (unknown) at ??:0 > {code} > The attatchment is the full test log -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-3106) getEndpointChannelBindings() isn't working as expected with BouncyCastle
[ https://issues.apache.org/jira/browse/KUDU-3106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3106: Summary: getEndpointChannelBindings() isn't working as expected with BouncyCastle (was: getEndpointChannelBindings() isn't working as expected with BouncyCastle 1.65) > getEndpointChannelBindings() isn't working as expected with BouncyCastle > > > Key: KUDU-3106 > URL: https://issues.apache.org/jira/browse/KUDU-3106 > Project: Kudu > Issue Type: Bug > Components: client, java, security >Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.5.0, 1.6.0, 1.7.0, 1.8.0, 1.7.1, > 1.9.0, 1.10.0, 1.10.1, 1.11.0, 1.11.1 >Reporter: Alexey Serbin >Assignee: Alexey Serbin >Priority: Major > Fix For: 1.12.0 > > > With [BouncyCastle|https://www.bouncycastle.org] 1.65 the code in > https://github.com/apache/kudu/blob/25ae6c5108cc84289f69c467d862e298d3361ea8/java/kudu-client/src/main/java/org/apache/kudu/util/SecurityUtil.java#L136-L159 > isn't working as expected throwing an exception: > {noformat} > java.lang.RuntimeException: cert uses unknown signature algorithm: > SHA256WITHRSA > {noformat} > It seems BouncyCastle 1.65 converts the name of the certificate signature > algorithm uppercase. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KUDU-3111) IWYU processes freestanding headers
Alexey Serbin created KUDU-3111: --- Summary: IWYU processes freestanding headers Key: KUDU-3111 URL: https://issues.apache.org/jira/browse/KUDU-3111 Project: Kudu Issue Type: Improvement Affects Versions: 1.11.1, 1.11.0, 1.10.1, 1.10.0, 1.9.0, 1.8.0, 1.7.0, 1.12.0 Reporter: Alexey Serbin When working out of the compilation database, IWYU processes only associated headers, i.e. {{.h}} files that pair corresponding {{.cc}} files. It would be nice to make IWYU processing so-called freestanding header files. [This thread|https://github.com/include-what-you-use/include-what-you-use/issues/268] contains very useful information on the topic. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-3111) Make IWYU processes freestanding headers
[ https://issues.apache.org/jira/browse/KUDU-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3111: Summary: Make IWYU processes freestanding headers (was: IWYU processes freestanding headers ) > Make IWYU processes freestanding headers > > > Key: KUDU-3111 > URL: https://issues.apache.org/jira/browse/KUDU-3111 > Project: Kudu > Issue Type: Improvement >Affects Versions: 1.7.0, 1.8.0, 1.9.0, 1.10.0, 1.10.1, 1.11.0, 1.12.0, > 1.11.1 >Reporter: Alexey Serbin >Priority: Major > > When working out of the compilation database, IWYU processes only associated > headers, i.e. {{.h}} files that pair corresponding {{.cc}} files. It would > be nice to make IWYU processing so-called freestanding header files. [This > thread|https://github.com/include-what-you-use/include-what-you-use/issues/268] > contains very useful information on the topic. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KUDU-3007) ARM/aarch64 platform support
[ https://issues.apache.org/jira/browse/KUDU-3007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091778#comment-17091778 ] Alexey Serbin commented on KUDU-3007: - Yes, I'm planning to take a closer look this weekend. Thank you for the contribution! > ARM/aarch64 platform support > > > Key: KUDU-3007 > URL: https://issues.apache.org/jira/browse/KUDU-3007 > Project: Kudu > Issue Type: Improvement >Reporter: liusheng >Priority: Critical > > As an import alternative of x86 architecture, Aarch64(ARM) architecture is > currently the dominate architecture in small devices like phone, IOT devices, > security cameras, drones etc. And also, there are more and more hadware or > cloud vendor start to provide ARM resources, such as AWS, Huawei, Packet, > Ampere. etc. Usually, the ARM servers are low cost and more cheap than x86 > servers, and now more and more ARM servers have comparative performance with > x86 servers, and even more efficient in some areas. > We want to propose to add an Aarch64 CI for KUDU to promote the support for > KUDU on Aarch64 platforms. We are willing to provide machines to the current > CI system and manpower to mananging the CI and fxing problems that occours. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KUDU-2986) Incorrect value for the 'live_row_count' metric with pre-1.11.0 tables
[ https://issues.apache.org/jira/browse/KUDU-2986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin resolved KUDU-2986. - Fix Version/s: 1.12.0 Resolution: Fixed > Incorrect value for the 'live_row_count' metric with pre-1.11.0 tables > -- > > Key: KUDU-2986 > URL: https://issues.apache.org/jira/browse/KUDU-2986 > Project: Kudu > Issue Type: Bug > Components: CLI, client, master, metrics >Affects Versions: 1.11.0 >Reporter: YifanZhang >Assignee: LiFu He >Priority: Major > Fix For: 1.12.0 > > > When we upgraded the cluster with pre-1.11.0 tables, we got inconsistent > values for the 'live_row_count' metric of these tables: > When visiting masterURL:port/metrics, we got 0 for old tables, and got a > positive integer for a old table with a newly added partition, which is the > count of rows in the newly added partition. > When getting table statistics via `kudu table statistics` CLI tool, we got 0 > for old tables and the old table with a new parition. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-3106) getEndpointChannelBindings() isn't working as expected with BouncyCastle 1.65
[ https://issues.apache.org/jira/browse/KUDU-3106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3106: Fix Version/s: 1.12.0 Resolution: Fixed Status: Resolved (was: In Review) > getEndpointChannelBindings() isn't working as expected with BouncyCastle 1.65 > - > > Key: KUDU-3106 > URL: https://issues.apache.org/jira/browse/KUDU-3106 > Project: Kudu > Issue Type: Bug > Components: client, java, security >Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.5.0, 1.6.0, 1.7.0, 1.8.0, 1.7.1, > 1.9.0, 1.10.0, 1.10.1, 1.11.0, 1.11.1 >Reporter: Alexey Serbin >Assignee: Alexey Serbin >Priority: Major > Fix For: 1.12.0 > > > With [BouncyCastle|https://www.bouncycastle.org] 2.65 the code in > https://github.com/apache/kudu/blob/25ae6c5108cc84289f69c467d862e298d3361ea8/java/kudu-client/src/main/java/org/apache/kudu/util/SecurityUtil.java#L136-L159 > isn't working as expected throwing an exception: > {noformat} > java.lang.RuntimeException: cert uses unknown signature algorithm: > SHA256WITHRSA > {noformat} > It seems BouncyCastle 1.65 converts the name of the certificate signature > algorithm uppercase. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-3106) getEndpointChannelBindings() isn't working as expected with BouncyCastle 1.65
[ https://issues.apache.org/jira/browse/KUDU-3106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3106: Description: With [BouncyCastle|https://www.bouncycastle.org] 1.65 the code in https://github.com/apache/kudu/blob/25ae6c5108cc84289f69c467d862e298d3361ea8/java/kudu-client/src/main/java/org/apache/kudu/util/SecurityUtil.java#L136-L159 isn't working as expected throwing an exception: {noformat} java.lang.RuntimeException: cert uses unknown signature algorithm: SHA256WITHRSA {noformat} It seems BouncyCastle 1.65 converts the name of the certificate signature algorithm uppercase. was: With [BouncyCastle|https://www.bouncycastle.org] 2.65 the code in https://github.com/apache/kudu/blob/25ae6c5108cc84289f69c467d862e298d3361ea8/java/kudu-client/src/main/java/org/apache/kudu/util/SecurityUtil.java#L136-L159 isn't working as expected throwing an exception: {noformat} java.lang.RuntimeException: cert uses unknown signature algorithm: SHA256WITHRSA {noformat} It seems BouncyCastle 1.65 converts the name of the certificate signature algorithm uppercase. > getEndpointChannelBindings() isn't working as expected with BouncyCastle 1.65 > - > > Key: KUDU-3106 > URL: https://issues.apache.org/jira/browse/KUDU-3106 > Project: Kudu > Issue Type: Bug > Components: client, java, security >Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.5.0, 1.6.0, 1.7.0, 1.8.0, 1.7.1, > 1.9.0, 1.10.0, 1.10.1, 1.11.0, 1.11.1 >Reporter: Alexey Serbin >Assignee: Alexey Serbin >Priority: Major > Fix For: 1.12.0 > > > With [BouncyCastle|https://www.bouncycastle.org] 1.65 the code in > https://github.com/apache/kudu/blob/25ae6c5108cc84289f69c467d862e298d3361ea8/java/kudu-client/src/main/java/org/apache/kudu/util/SecurityUtil.java#L136-L159 > isn't working as expected throwing an exception: > {noformat} > java.lang.RuntimeException: cert uses unknown signature algorithm: > SHA256WITHRSA > {noformat} > It seems BouncyCastle 1.65 converts the name of the certificate signature > algorithm uppercase. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-3106) getEndpointChannelBindings() isn't working as expected with BouncyCastle 1.65
[ https://issues.apache.org/jira/browse/KUDU-3106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3106: Status: In Review (was: In Progress) > getEndpointChannelBindings() isn't working as expected with BouncyCastle 1.65 > - > > Key: KUDU-3106 > URL: https://issues.apache.org/jira/browse/KUDU-3106 > Project: Kudu > Issue Type: Bug > Components: client, java, security >Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.5.0, 1.6.0, 1.7.0, 1.8.0, 1.7.1, > 1.9.0, 1.10.0, 1.10.1, 1.11.0, 1.11.1 >Reporter: Alexey Serbin >Assignee: Alexey Serbin >Priority: Major > > With [BouncyCastle|https://www.bouncycastle.org] 2.65 the code in > https://github.com/apache/kudu/blob/25ae6c5108cc84289f69c467d862e298d3361ea8/java/kudu-client/src/main/java/org/apache/kudu/util/SecurityUtil.java#L136-L159 > isn't working as expected throwing an exception: > {noformat} > java.lang.RuntimeException: cert uses unknown signature algorithm: > SHA256WITHRSA > {noformat} > It seems BouncyCastle 1.65 converts the name of the certificate signature > algorithm uppercase. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-3106) getEndpointChannelBindings() isn't working as expected with BouncyCastle 1.65
[ https://issues.apache.org/jira/browse/KUDU-3106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3106: Code Review: http://gerrit.cloudera.org:8080/15664 > getEndpointChannelBindings() isn't working as expected with BouncyCastle 1.65 > - > > Key: KUDU-3106 > URL: https://issues.apache.org/jira/browse/KUDU-3106 > Project: Kudu > Issue Type: Bug > Components: client, java, security >Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.5.0, 1.6.0, 1.7.0, 1.8.0, 1.7.1, > 1.9.0, 1.10.0, 1.10.1, 1.11.0, 1.11.1 >Reporter: Alexey Serbin >Assignee: Alexey Serbin >Priority: Major > > With [BouncyCastle|https://www.bouncycastle.org] 2.65 the code in > https://github.com/apache/kudu/blob/25ae6c5108cc84289f69c467d862e298d3361ea8/java/kudu-client/src/main/java/org/apache/kudu/util/SecurityUtil.java#L136-L159 > isn't working as expected throwing an exception: > {noformat} > java.lang.RuntimeException: cert uses unknown signature algorithm: > SHA256WITHRSA > {noformat} > It seems BouncyCastle 1.65 converts the name of the certificate signature > algorithm uppercase. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (KUDU-3106) getEndpointChannelBindings() isn't working as expected with BouncyCastle 1.65
[ https://issues.apache.org/jira/browse/KUDU-3106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin reassigned KUDU-3106: --- Assignee: Alexey Serbin > getEndpointChannelBindings() isn't working as expected with BouncyCastle 1.65 > - > > Key: KUDU-3106 > URL: https://issues.apache.org/jira/browse/KUDU-3106 > Project: Kudu > Issue Type: Bug > Components: client, java, security >Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.5.0, 1.6.0, 1.7.0, 1.8.0, 1.7.1, > 1.9.0, 1.10.0, 1.10.1, 1.11.0, 1.11.1 >Reporter: Alexey Serbin >Assignee: Alexey Serbin >Priority: Major > > With [BouncyCastle|https://www.bouncycastle.org] 2.65 the code in > https://github.com/apache/kudu/blob/25ae6c5108cc84289f69c467d862e298d3361ea8/java/kudu-client/src/main/java/org/apache/kudu/util/SecurityUtil.java#L136-L159 > isn't working as expected throwing an exception: > {noformat} > java.lang.RuntimeException: cert uses unknown signature algorithm: > SHA256WITHRSA > {noformat} > It seems BouncyCastle 1.65 converts the name of the certificate signature > algorithm uppercase. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-3106) getEndpointChannelBindings() isn't working as expected with BouncyCastle 1.65
[ https://issues.apache.org/jira/browse/KUDU-3106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3106: Summary: getEndpointChannelBindings() isn't working as expected with BouncyCastle 1.65 (was: getEndpointChannelBindings() isn't working as expected with BouncyCastle 2.65) > getEndpointChannelBindings() isn't working as expected with BouncyCastle 1.65 > - > > Key: KUDU-3106 > URL: https://issues.apache.org/jira/browse/KUDU-3106 > Project: Kudu > Issue Type: Bug > Components: client, java, security >Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.5.0, 1.6.0, 1.7.0, 1.8.0, 1.7.1, > 1.9.0, 1.10.0, 1.10.1, 1.11.0, 1.11.1 >Reporter: Alexey Serbin >Priority: Major > > With [BouncyCastle|https://www.bouncycastle.org] 2.65 the code in > https://github.com/apache/kudu/blob/25ae6c5108cc84289f69c467d862e298d3361ea8/java/kudu-client/src/main/java/org/apache/kudu/util/SecurityUtil.java#L136-L159 > isn't working as expected throwing an exception: > {noformat} > java.lang.RuntimeException: cert uses unknown signature algorithm: > SHA256WITHRSA > {noformat} > It seems BouncyCastle 2.65 converts the name of the certificate signature > algorithm uppercase. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-3106) getEndpointChannelBindings() isn't working as expected with BouncyCastle 1.65
[ https://issues.apache.org/jira/browse/KUDU-3106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3106: Description: With [BouncyCastle|https://www.bouncycastle.org] 2.65 the code in https://github.com/apache/kudu/blob/25ae6c5108cc84289f69c467d862e298d3361ea8/java/kudu-client/src/main/java/org/apache/kudu/util/SecurityUtil.java#L136-L159 isn't working as expected throwing an exception: {noformat} java.lang.RuntimeException: cert uses unknown signature algorithm: SHA256WITHRSA {noformat} It seems BouncyCastle 1.65 converts the name of the certificate signature algorithm uppercase. was: With [BouncyCastle|https://www.bouncycastle.org] 2.65 the code in https://github.com/apache/kudu/blob/25ae6c5108cc84289f69c467d862e298d3361ea8/java/kudu-client/src/main/java/org/apache/kudu/util/SecurityUtil.java#L136-L159 isn't working as expected throwing an exception: {noformat} java.lang.RuntimeException: cert uses unknown signature algorithm: SHA256WITHRSA {noformat} It seems BouncyCastle 2.65 converts the name of the certificate signature algorithm uppercase. > getEndpointChannelBindings() isn't working as expected with BouncyCastle 1.65 > - > > Key: KUDU-3106 > URL: https://issues.apache.org/jira/browse/KUDU-3106 > Project: Kudu > Issue Type: Bug > Components: client, java, security >Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.5.0, 1.6.0, 1.7.0, 1.8.0, 1.7.1, > 1.9.0, 1.10.0, 1.10.1, 1.11.0, 1.11.1 >Reporter: Alexey Serbin >Priority: Major > > With [BouncyCastle|https://www.bouncycastle.org] 2.65 the code in > https://github.com/apache/kudu/blob/25ae6c5108cc84289f69c467d862e298d3361ea8/java/kudu-client/src/main/java/org/apache/kudu/util/SecurityUtil.java#L136-L159 > isn't working as expected throwing an exception: > {noformat} > java.lang.RuntimeException: cert uses unknown signature algorithm: > SHA256WITHRSA > {noformat} > It seems BouncyCastle 1.65 converts the name of the certificate signature > algorithm uppercase. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KUDU-3106) getEndpointChannelBindings() isn't working as expected with BouncyCastle 2.65
Alexey Serbin created KUDU-3106: --- Summary: getEndpointChannelBindings() isn't working as expected with BouncyCastle 2.65 Key: KUDU-3106 URL: https://issues.apache.org/jira/browse/KUDU-3106 Project: Kudu Issue Type: Bug Components: client, java, security Affects Versions: 1.11.1, 1.11.0, 1.10.1, 1.10.0, 1.9.0, 1.7.1, 1.8.0, 1.7.0, 1.6.0, 1.5.0, 1.4.0, 1.3.1, 1.3.0 Reporter: Alexey Serbin With [BouncyCastle|https://www.bouncycastle.org] 2.65 the code in https://github.com/apache/kudu/blob/25ae6c5108cc84289f69c467d862e298d3361ea8/java/kudu-client/src/main/java/org/apache/kudu/util/SecurityUtil.java#L136-L159 isn't working as expected throwing an exception: {noformat} java.lang.RuntimeException: cert uses unknown signature algorithm: SHA256WITHRSA {noformat} It seems BouncyCastle 2.65 converts the name of the certificate signature algorithm uppercase. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KUDU-2573) Fully support Chrony in place of NTP
[ https://issues.apache.org/jira/browse/KUDU-2573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17076583#comment-17076583 ] Alexey Serbin commented on KUDU-2573: - With this [changelist|https://gerrit.cloudera.org/#/c/15456/], the necessary piece of the documentation will be in 1.12 release notes. > Fully support Chrony in place of NTP > > > Key: KUDU-2573 > URL: https://issues.apache.org/jira/browse/KUDU-2573 > Project: Kudu > Issue Type: New Feature > Components: clock, master, tserver >Reporter: Grant Henke >Assignee: Alexey Serbin >Priority: Major > Labels: clock > > This is to track fully supporting Chrony in place of NTP. Given Chrony is the > default in RHEL7+, running Kudu with Chrony is likely to be more common. > The work should entail: > * identifying and fixing or documenting any differences or gaps > * removing the experimental warnings from the documentation -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KUDU-2573) Fully support Chrony in place of NTP
[ https://issues.apache.org/jira/browse/KUDU-2573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin resolved KUDU-2573. - Fix Version/s: 1.12.0 Resolution: Fixed > Fully support Chrony in place of NTP > > > Key: KUDU-2573 > URL: https://issues.apache.org/jira/browse/KUDU-2573 > Project: Kudu > Issue Type: New Feature > Components: clock, master, tserver >Reporter: Grant Henke >Assignee: Alexey Serbin >Priority: Major > Labels: clock > Fix For: 1.12.0 > > > This is to track fully supporting Chrony in place of NTP. Given Chrony is the > default in RHEL7+, running Kudu with Chrony is likely to be more common. > The work should entail: > * identifying and fixing or documenting any differences or gaps > * removing the experimental warnings from the documentation -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-2798) Fix logging on deleted TSK entries
[ https://issues.apache.org/jira/browse/KUDU-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-2798: Affects Version/s: 1.10.1 1.11.0 1.11.1 > Fix logging on deleted TSK entries > -- > > Key: KUDU-2798 > URL: https://issues.apache.org/jira/browse/KUDU-2798 > Project: Kudu > Issue Type: Task >Affects Versions: 1.8.0, 1.9.0, 1.9.1, 1.10.0, 1.10.1, 1.11.0, 1.11.1 >Reporter: Alexey Serbin >Assignee: Alexey Serbin >Priority: Minor > Labels: newbie > > It seems the identifiers of the deleted TSK entries in the log lines below > need decoding: > {noformat} > I0312 15:17:14.808763 71553 catalog_manager.cc:4095] T > P f05d759af7824df9aafedcc106674182: > Generated new TSK 2 > I0312 15:17:14.811144 71553 catalog_manager.cc:4133] T > P f05d759af7824df9aafedcc106674182: Deleted > TSKs: �, � > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-2798) Fix logging on deleted TSK entries
[ https://issues.apache.org/jira/browse/KUDU-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-2798: Code Review: https://gerrit.cloudera.org/#/c/15657/ > Fix logging on deleted TSK entries > -- > > Key: KUDU-2798 > URL: https://issues.apache.org/jira/browse/KUDU-2798 > Project: Kudu > Issue Type: Task >Affects Versions: 1.8.0, 1.9.0, 1.9.1, 1.10.0 >Reporter: Alexey Serbin >Assignee: Alexey Serbin >Priority: Minor > Labels: newbie > > It seems the identifiers of the deleted TSK entries in the log lines below > need decoding: > {noformat} > I0312 15:17:14.808763 71553 catalog_manager.cc:4095] T > P f05d759af7824df9aafedcc106674182: > Generated new TSK 2 > I0312 15:17:14.811144 71553 catalog_manager.cc:4133] T > P f05d759af7824df9aafedcc106674182: Deleted > TSKs: �, � > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (KUDU-2798) Fix logging on deleted TSK entries
[ https://issues.apache.org/jira/browse/KUDU-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin reassigned KUDU-2798: --- Assignee: Alexey Serbin > Fix logging on deleted TSK entries > -- > > Key: KUDU-2798 > URL: https://issues.apache.org/jira/browse/KUDU-2798 > Project: Kudu > Issue Type: Task >Affects Versions: 1.8.0, 1.9.0, 1.9.1, 1.10.0 >Reporter: Alexey Serbin >Assignee: Alexey Serbin >Priority: Minor > Labels: newbie > > It seems the identifiers of the deleted TSK entries in the log lines below > need decoding: > {noformat} > I0312 15:17:14.808763 71553 catalog_manager.cc:4095] T > P f05d759af7824df9aafedcc106674182: > Generated new TSK 2 > I0312 15:17:14.811144 71553 catalog_manager.cc:4133] T > P f05d759af7824df9aafedcc106674182: Deleted > TSKs: �, � > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-2798) Fix logging on deleted TSK entries
[ https://issues.apache.org/jira/browse/KUDU-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-2798: Status: In Review (was: In Progress) > Fix logging on deleted TSK entries > -- > > Key: KUDU-2798 > URL: https://issues.apache.org/jira/browse/KUDU-2798 > Project: Kudu > Issue Type: Task >Affects Versions: 1.8.0, 1.9.0, 1.9.1, 1.10.0 >Reporter: Alexey Serbin >Assignee: Alexey Serbin >Priority: Minor > Labels: newbie > > It seems the identifiers of the deleted TSK entries in the log lines below > need decoding: > {noformat} > I0312 15:17:14.808763 71553 catalog_manager.cc:4095] T > P f05d759af7824df9aafedcc106674182: > Generated new TSK 2 > I0312 15:17:14.811144 71553 catalog_manager.cc:4133] T > P f05d759af7824df9aafedcc106674182: Deleted > TSKs: �, � > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KUDU-3105) kudu_client based application reports 'Locking callback not initialized' error
Alexey Serbin created KUDU-3105: --- Summary: kudu_client based application reports 'Locking callback not initialized' error Key: KUDU-3105 URL: https://issues.apache.org/jira/browse/KUDU-3105 Project: Kudu Issue Type: Bug Components: client, python, security Affects Versions: 1.11.1, 1.11.0, 1.10.1, 1.10.0, 1.9.0 Reporter: Alexey Serbin When using kudu_client library compiled against OpenSSL 1.0.x with OpenSSL 1.1.x run-time, Kudu client applications might report 'Runtime error: Locking callback not initialized' error. For example, {{kudu-python}} based applications on RHEL/CentOS 7.7, if using {{kudu-client}} of versions 1.9, 1.10, 1.11 in Python environment with OpenSSL 1.1.1d might report an error like below: {noformat} Traceback (most recent call last): File "kudu-python-app.py", line 22, in client = kudu.connect(host=args.masters, port=args.ports) File "/opt/lib/python3.6/site-packages/kudu/__init__.py", line 96, in connect rpc_timeout_ms=rpc_timeout_ms) File "kudu/client.pyx", line 297, in kudu.client.Client.__cinit__ File "kudu/errors.pyx", line 62, in kudu.errors.check_status kudu.errors.KuduBadStatus: b'Runtime error: Locking callback not initialized' {noformat} The issue is that the code {{libkudu_client}} compiled against OpenSSL 1.0.x uses initialization code path specific for OpenSSL 1.0.x version, and the post-condition requires presence of thread-safe callbacks installed after the initialization is done. However, those functions do not install the expected locking callbacks in OpenSSL 1.1.x since OpenSSL uses different approach w.r.t. locking callbacks since 1.1.0 version: the callbacks are not required since the multi-threading model was revamped in the newer versions of the library. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (KUDU-3082) tablets in "CONSENSUS_MISMATCH" state for a long time
[ https://issues.apache.org/jira/browse/KUDU-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin reassigned KUDU-3082: --- Assignee: Alexey Serbin > tablets in "CONSENSUS_MISMATCH" state for a long time > - > > Key: KUDU-3082 > URL: https://issues.apache.org/jira/browse/KUDU-3082 > Project: Kudu > Issue Type: Bug > Components: consensus >Affects Versions: 1.10.1 >Reporter: YifanZhang >Assignee: Alexey Serbin >Priority: Major > Attachments: master_leader.log, ts25.info.gz, ts26.log.gz > > > Lately we found a few tablets in one of our clusters are unhealthy, the ksck > output is like: > > {code:java} > Tablet Summary > Tablet 7404240f458f462d92b6588d07583a52 of table '' is conflicted: 3 > replicas' active configs disagree with the leader master's > 7380d797d2ea49e88d71091802fb1c81 (kudu-ts26): RUNNING > d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING > 47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER] > All reported replicas are: > A = 7380d797d2ea49e88d71091802fb1c81 > B = d1952499f94a4e6087bee28466fcb09f > C = 47af52df1adc47e1903eb097e9c88f2e > D = 08beca5ed4d04003b6979bf8bac378d2 > The consensus matrix is: > Config source | Replicas | Current term | Config index | Committed? > ---+--+--+--+ > master| A B C* | | | Yes > A | A B C* | 5| -1 | Yes > B | A B C| 5| -1 | Yes > C | A B C* D~ | 5| 54649| No > Tablet 6d9d3fb034314fa7bee9cfbf602bcdc8 of table '' is conflicted: 2 > replicas' active configs disagree with the leader master's > d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING > 47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER] > 5a8aeadabdd140c29a09dabcae919b31 (kudu-ts21): RUNNING > All reported replicas are: > A = d1952499f94a4e6087bee28466fcb09f > B = 47af52df1adc47e1903eb097e9c88f2e > C = 5a8aeadabdd140c29a09dabcae919b31 > D = 14632cdbb0d04279bc772f64e06389f9 > The consensus matrix is: > Config source | Replicas | Current term | Config index | Committed? > ---+--+--+--+ > master| A B* C| | | Yes > A | A B* C| 5| 5| Yes > B | A B* C D~ | 5| 96176| No > C | A B* C| 5| 5| Yes > Tablet bf1ec7d693b94632b099dc0928e76363 of table '' is conflicted: 1 > replicas' active configs disagree with the leader master's > a9eaff3cf1ed483aae84954d649a (kudu-ts23): RUNNING > f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING > 47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER] > All reported replicas are: > A = a9eaff3cf1ed483aae84954d649a > B = f75df4a6b5ce404884313af5f906b392 > C = 47af52df1adc47e1903eb097e9c88f2e > D = d1952499f94a4e6087bee28466fcb09f > The consensus matrix is: > Config source | Replicas | Current term | Config index | Committed? > ---+--+--+--+ > master| A B C* | | | Yes > A | A B C* | 1| -1 | Yes > B | A B C* | 1| -1 | Yes > C | A B C* D~ | 1| 2| No > Tablet 3190a310857e4c64997adb477131488a of table '' is conflicted: 3 > replicas' active configs disagree with the leader master's > 47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER] > f0f7b2f4b9d344e6929105f48365f38e (kudu-ts24): RUNNING > f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING > All reported replicas are: > A = 47af52df1adc47e1903eb097e9c88f2e > B = f0f7b2f4b9d344e6929105f48365f38e > C = f75df4a6b5ce404884313af5f906b392 > D = d1952499f94a4e6087bee28466fcb09f > The consensus matrix is: > Config source | Replicas | Current term | Config index | Committed? > ---+--+--+--+ > master| A* B C| | | Yes > A | A* B C D~ | 1| 1991 | No > B | A* B C| 1| 4| Yes > C | A* B C| 1| 4| Yes{code} > These tablets couldn't recover for a couple of days until we restart > kudu-ts27. > I found so many duplicated logs in kudu-ts27 are like: > {code:java} > I0314 04:38:41.511279 65731 raft_con
[jira] [Assigned] (KUDU-3098) leadership change during tablet_copy process may lead to an isolate replica
[ https://issues.apache.org/jira/browse/KUDU-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin reassigned KUDU-3098: --- Assignee: Alexey Serbin > leadership change during tablet_copy process may lead to an isolate replica > --- > > Key: KUDU-3098 > URL: https://issues.apache.org/jira/browse/KUDU-3098 > Project: Kudu > Issue Type: Bug > Components: consensus, master >Affects Versions: 1.10.1 >Reporter: YifanZhang >Assignee: Alexey Serbin >Priority: Major > > Lately we found some tablets in a cluster with a very large > "time_since_last_leader_heartbeat" metric, they are LEARNER/NON_VOTER and > seems couldn't become VOTER for a long time. > These replicas created during the rebalance/tablet_copy process. After > beginning a new copy session from leader to the new added NON_VOTER peer, > leadership changed, old leader aborted uncommited CHANGE_CONFIG_OP operation. > Finally the tablet_copy session ended but new leader knew nothing about the > new peer. > Master didn't delete this new added replica because it has a larger > opid_index than the latest reported committed config. See the comments in > CatalogManager::ProcessTabletReport > {code:java} > // 5. Tombstone a replica that is no longer part of the Raft config (and > // not already tombstoned or deleted outright). > // > // If the report includes a committed raft config, we only tombstone if > // the opid_index is strictly less than the latest reported committed > // config. This prevents us from spuriously deleting replicas that have > // just been added to the committed config and are in the process of copying. > {code} > Maybe we shouldn't use opid_index to determine if replicas are in the process > of copying. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KUDU-3100) AutoRebalancerTest.TestHandlingFailedTservers sometimes fails
Alexey Serbin created KUDU-3100: --- Summary: AutoRebalancerTest.TestHandlingFailedTservers sometimes fails Key: KUDU-3100 URL: https://issues.apache.org/jira/browse/KUDU-3100 Project: Kudu Issue Type: Bug Components: test Affects Versions: 1.12.0 Reporter: Alexey Serbin Attachments: auto_rebalancer-test.txt.xz The {{AutoRebalancerTest.TestHandlingFailedTservers}} sometimes fails with the following error messages: {noformat} W0327 22:40:10.759768 6796 auto_rebalancer.cc:666] Could not move replica: Network error: Client connection negotiation failed: client connection to 127.2.102.194:33557: connect: Connection refused (error 111) /data0/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/master/auto_rebalancer-test.cc:524: Failure Value of: matched Actual: false Expected: true not one string matched pattern scheduled replica move failed to complete: Network error {noformat} The log is attached. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-3100) AutoRebalancerTest.TestHandlingFailedTservers sometimes fails
[ https://issues.apache.org/jira/browse/KUDU-3100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3100: Attachment: auto_rebalancer-test.txt.xz > AutoRebalancerTest.TestHandlingFailedTservers sometimes fails > - > > Key: KUDU-3100 > URL: https://issues.apache.org/jira/browse/KUDU-3100 > Project: Kudu > Issue Type: Bug > Components: test >Affects Versions: 1.12.0 >Reporter: Alexey Serbin >Priority: Major > Attachments: auto_rebalancer-test.txt.xz > > > The {{AutoRebalancerTest.TestHandlingFailedTservers}} sometimes fails with > the following error messages: > {noformat} > W0327 22:40:10.759768 6796 auto_rebalancer.cc:666] Could not move replica: > Network error: Client connection negotiation failed: client connection to > 127.2.102.194:33557: connect: Connection refused (error 111) > /data0/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/master/auto_rebalancer-test.cc:524: > Failure > Value of: matched > > Actual: false > > Expected: true > > not one string matched pattern scheduled replica move failed to complete: > Network error > {noformat} > The log is attached. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KUDU-3095) RaftConsensusNonVoterITest.PromotedReplicaCanVote sometimes fails
Alexey Serbin created KUDU-3095: --- Summary: RaftConsensusNonVoterITest.PromotedReplicaCanVote sometimes fails Key: KUDU-3095 URL: https://issues.apache.org/jira/browse/KUDU-3095 Project: Kudu Issue Type: Bug Components: test Affects Versions: 1.12.0 Reporter: Alexey Serbin Attachments: raft_consensus_nonvoter-itest.txt.xz The {{RaftConsensusNonVoterITest.PromotedReplicaCanVote}} scenario sometimes fails with an error: {noformat} I0327 00:44:00.297801 4401 raft_consensus.cc:2810] T c2378cfec6604e0e813f43775107f2e6 P 4f6b943b18a649fabbd6cfb8d06ed20f [term 3 FOLLOWER]: CHANGE_CONFIG_OP replication failed: Aborted: Transaction aborted by new leader /data0/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/integration-tests/raft_consensus_nonvoter-itest.cc:1079: Failure Failed {noformat} The log is attached. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KUDU-3094) KuduTest.TestWebUIDoesNotCrashCluster sometimes fail
Alexey Serbin created KUDU-3094: --- Summary: KuduTest.TestWebUIDoesNotCrashCluster sometimes fail Key: KUDU-3094 URL: https://issues.apache.org/jira/browse/KUDU-3094 Project: Kudu Issue Type: Bug Components: test Affects Versions: 1.12.0 Reporter: Alexey Serbin Attachments: webserver-stress-itest.txt.xz The {{KuduTest.TestWebUIDoesNotCrashCluster}} test scenario sometimes fails, timing out on creating the test table: {noformat} F0327 00:47:41.845475 361 test_workload.cc:329] Timed out: Timed out waiting for Table Creation {noformat} The log is attached. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KUDU-3082) tablets in "CONSENSUS_MISMATCH" state for a long time
[ https://issues.apache.org/jira/browse/KUDU-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17068181#comment-17068181 ] Alexey Serbin commented on KUDU-3082: - [~zhangyifan27], do you have an idea what might lead to such a situation? Anything specific happened to the cluster? I'm trying to have a reproduction scenario for this. Any hint might be useful. Thanks! > tablets in "CONSENSUS_MISMATCH" state for a long time > - > > Key: KUDU-3082 > URL: https://issues.apache.org/jira/browse/KUDU-3082 > Project: Kudu > Issue Type: Bug > Components: consensus >Affects Versions: 1.10.1 >Reporter: YifanZhang >Priority: Major > > Lately we found a few tablets in one of our clusters are unhealthy, the ksck > output is like: > > {code:java} > Tablet Summary > Tablet 7404240f458f462d92b6588d07583a52 of table '' is conflicted: 3 > replicas' active configs disagree with the leader master's > 7380d797d2ea49e88d71091802fb1c81 (kudu-ts26): RUNNING > d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING > 47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER] > All reported replicas are: > A = 7380d797d2ea49e88d71091802fb1c81 > B = d1952499f94a4e6087bee28466fcb09f > C = 47af52df1adc47e1903eb097e9c88f2e > D = 08beca5ed4d04003b6979bf8bac378d2 > The consensus matrix is: > Config source | Replicas | Current term | Config index | Committed? > ---+--+--+--+ > master| A B C* | | | Yes > A | A B C* | 5| -1 | Yes > B | A B C| 5| -1 | Yes > C | A B C* D~ | 5| 54649| No > Tablet 6d9d3fb034314fa7bee9cfbf602bcdc8 of table '' is conflicted: 2 > replicas' active configs disagree with the leader master's > d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING > 47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER] > 5a8aeadabdd140c29a09dabcae919b31 (kudu-ts21): RUNNING > All reported replicas are: > A = d1952499f94a4e6087bee28466fcb09f > B = 47af52df1adc47e1903eb097e9c88f2e > C = 5a8aeadabdd140c29a09dabcae919b31 > D = 14632cdbb0d04279bc772f64e06389f9 > The consensus matrix is: > Config source | Replicas | Current term | Config index | Committed? > ---+--+--+--+ > master| A B* C| | | Yes > A | A B* C| 5| 5| Yes > B | A B* C D~ | 5| 96176| No > C | A B* C| 5| 5| Yes > Tablet bf1ec7d693b94632b099dc0928e76363 of table '' is conflicted: 1 > replicas' active configs disagree with the leader master's > a9eaff3cf1ed483aae84954d649a (kudu-ts23): RUNNING > f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING > 47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER] > All reported replicas are: > A = a9eaff3cf1ed483aae84954d649a > B = f75df4a6b5ce404884313af5f906b392 > C = 47af52df1adc47e1903eb097e9c88f2e > D = d1952499f94a4e6087bee28466fcb09f > The consensus matrix is: > Config source | Replicas | Current term | Config index | Committed? > ---+--+--+--+ > master| A B C* | | | Yes > A | A B C* | 1| -1 | Yes > B | A B C* | 1| -1 | Yes > C | A B C* D~ | 1| 2| No > Tablet 3190a310857e4c64997adb477131488a of table '' is conflicted: 3 > replicas' active configs disagree with the leader master's > 47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER] > f0f7b2f4b9d344e6929105f48365f38e (kudu-ts24): RUNNING > f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING > All reported replicas are: > A = 47af52df1adc47e1903eb097e9c88f2e > B = f0f7b2f4b9d344e6929105f48365f38e > C = f75df4a6b5ce404884313af5f906b392 > D = d1952499f94a4e6087bee28466fcb09f > The consensus matrix is: > Config source | Replicas | Current term | Config index | Committed? > ---+--+--+--+ > master| A* B C| | | Yes > A | A* B C D~ | 1| 1991 | No > B | A* B C| 1| 4| Yes > C | A* B C| 1| 4| Yes{code} > These tablets couldn't recover for a couple of days until we resta
[jira] [Updated] (KUDU-3087) Python tests failed on arm64
[ https://issues.apache.org/jira/browse/KUDU-3087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3087: Status: In Review (was: In Progress) > Python tests failed on arm64 > > > Key: KUDU-3087 > URL: https://issues.apache.org/jira/browse/KUDU-3087 > Project: Kudu > Issue Type: Sub-task >Reporter: huangtianhua >Assignee: Alexey Serbin >Priority: Major > Attachments: python_test.rar > > > I took python tests for kudu on arm64 platform based on > https://gerrit.cloudera.org/#/c/14964/ the tests failed, error info as > below: > W0323 02:54:39.938022 9110 negotiation.cc:313] Failed RPC negotiation. Trace: > 0323 02:54:39.936597 (+ 0us) reactor.cc:604] Submitting negotiation task > for client connection to 127.8.25.194:34669 > 0323 02:54:39.936737 (+ 140us) negotiation.cc:98] Waiting for socket to > connect > 0323 02:54:39.936746 (+ 9us) client_negotiation.cc:169] Beginning > negotiation > 0323 02:54:39.936810 (+64us) client_negotiation.cc:246] Sending NEGOTIATE > NegotiatePB request > 0323 02:54:39.937073 (+ 263us) client_negotiation.cc:263] Received > NEGOTIATE NegotiatePB response > 0323 02:54:39.937074 (+ 1us) client_negotiation.cc:357] Received > NEGOTIATE response from server > 0323 02:54:39.937079 (+ 5us) client_negotiation.cc:184] Negotiated > authn=TOKEN > 0323 02:54:39.937168 (+89us) client_negotiation.cc:473] Sending > TLS_HANDSHAKE message to server > 0323 02:54:39.937171 (+ 3us) client_negotiation.cc:246] Sending > TLS_HANDSHAKE NegotiatePB request > 0323 02:54:39.937724 (+ 553us) client_negotiation.cc:263] Received > TLS_HANDSHAKE NegotiatePB response > 0323 02:54:39.937726 (+ 2us) client_negotiation.cc:486] Received > TLS_HANDSHAKE response from server > 0323 02:54:39.937906 (+ 180us) negotiation.cc:304] Negotiation complete: > Runtime error: Client connection negotiation failed: client connection to > 127.8.25.194:34669: TLS Handshake error: error:1416F086:SSL > routines:tls_process_server_certificate:certificate verify > failed:../ssl/statem/statem_clnt.c:1924 > Metrics: > {"client-negotiator.queue_time_us":90,"thread_start_us":41,"threads_started":1} > The python tests were successful before the commit > https://github.com/apache/kudu/commit/3343144fefaad5a30e95e21297c64c78e308fa1f > and I tried to remove this commit based on master and then the python tests > are success, seems the problem introduced by > https://github.com/apache/kudu/commit/3343144fefaad5a30e95e21297c64c78e308fa1f, > but I am sorry I can't fix this, could someone help me?Thanks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-3087) Python tests failed on arm64
[ https://issues.apache.org/jira/browse/KUDU-3087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3087: Code Review: http://gerrit.cloudera.org:8080/15554 > Python tests failed on arm64 > > > Key: KUDU-3087 > URL: https://issues.apache.org/jira/browse/KUDU-3087 > Project: Kudu > Issue Type: Sub-task >Reporter: huangtianhua >Assignee: Alexey Serbin >Priority: Major > Attachments: python_test.rar > > > I took python tests for kudu on arm64 platform based on > https://gerrit.cloudera.org/#/c/14964/ the tests failed, error info as > below: > W0323 02:54:39.938022 9110 negotiation.cc:313] Failed RPC negotiation. Trace: > 0323 02:54:39.936597 (+ 0us) reactor.cc:604] Submitting negotiation task > for client connection to 127.8.25.194:34669 > 0323 02:54:39.936737 (+ 140us) negotiation.cc:98] Waiting for socket to > connect > 0323 02:54:39.936746 (+ 9us) client_negotiation.cc:169] Beginning > negotiation > 0323 02:54:39.936810 (+64us) client_negotiation.cc:246] Sending NEGOTIATE > NegotiatePB request > 0323 02:54:39.937073 (+ 263us) client_negotiation.cc:263] Received > NEGOTIATE NegotiatePB response > 0323 02:54:39.937074 (+ 1us) client_negotiation.cc:357] Received > NEGOTIATE response from server > 0323 02:54:39.937079 (+ 5us) client_negotiation.cc:184] Negotiated > authn=TOKEN > 0323 02:54:39.937168 (+89us) client_negotiation.cc:473] Sending > TLS_HANDSHAKE message to server > 0323 02:54:39.937171 (+ 3us) client_negotiation.cc:246] Sending > TLS_HANDSHAKE NegotiatePB request > 0323 02:54:39.937724 (+ 553us) client_negotiation.cc:263] Received > TLS_HANDSHAKE NegotiatePB response > 0323 02:54:39.937726 (+ 2us) client_negotiation.cc:486] Received > TLS_HANDSHAKE response from server > 0323 02:54:39.937906 (+ 180us) negotiation.cc:304] Negotiation complete: > Runtime error: Client connection negotiation failed: client connection to > 127.8.25.194:34669: TLS Handshake error: error:1416F086:SSL > routines:tls_process_server_certificate:certificate verify > failed:../ssl/statem/statem_clnt.c:1924 > Metrics: > {"client-negotiator.queue_time_us":90,"thread_start_us":41,"threads_started":1} > The python tests were successful before the commit > https://github.com/apache/kudu/commit/3343144fefaad5a30e95e21297c64c78e308fa1f > and I tried to remove this commit based on master and then the python tests > are success, seems the problem introduced by > https://github.com/apache/kudu/commit/3343144fefaad5a30e95e21297c64c78e308fa1f, > but I am sorry I can't fix this, could someone help me?Thanks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KUDU-3087) Python tests failed on arm64
[ https://issues.apache.org/jira/browse/KUDU-3087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066360#comment-17066360 ] Alexey Serbin commented on KUDU-3087: - Thank you for pinging me w.r.t. to the progress on this issue, [~huangtianhua]! I posted a patch for review: http://gerrit.cloudera.org:8080/15554 > Python tests failed on arm64 > > > Key: KUDU-3087 > URL: https://issues.apache.org/jira/browse/KUDU-3087 > Project: Kudu > Issue Type: Sub-task >Reporter: huangtianhua >Assignee: Alexey Serbin >Priority: Major > Attachments: python_test.rar > > > I took python tests for kudu on arm64 platform based on > https://gerrit.cloudera.org/#/c/14964/ the tests failed, error info as > below: > W0323 02:54:39.938022 9110 negotiation.cc:313] Failed RPC negotiation. Trace: > 0323 02:54:39.936597 (+ 0us) reactor.cc:604] Submitting negotiation task > for client connection to 127.8.25.194:34669 > 0323 02:54:39.936737 (+ 140us) negotiation.cc:98] Waiting for socket to > connect > 0323 02:54:39.936746 (+ 9us) client_negotiation.cc:169] Beginning > negotiation > 0323 02:54:39.936810 (+64us) client_negotiation.cc:246] Sending NEGOTIATE > NegotiatePB request > 0323 02:54:39.937073 (+ 263us) client_negotiation.cc:263] Received > NEGOTIATE NegotiatePB response > 0323 02:54:39.937074 (+ 1us) client_negotiation.cc:357] Received > NEGOTIATE response from server > 0323 02:54:39.937079 (+ 5us) client_negotiation.cc:184] Negotiated > authn=TOKEN > 0323 02:54:39.937168 (+89us) client_negotiation.cc:473] Sending > TLS_HANDSHAKE message to server > 0323 02:54:39.937171 (+ 3us) client_negotiation.cc:246] Sending > TLS_HANDSHAKE NegotiatePB request > 0323 02:54:39.937724 (+ 553us) client_negotiation.cc:263] Received > TLS_HANDSHAKE NegotiatePB response > 0323 02:54:39.937726 (+ 2us) client_negotiation.cc:486] Received > TLS_HANDSHAKE response from server > 0323 02:54:39.937906 (+ 180us) negotiation.cc:304] Negotiation complete: > Runtime error: Client connection negotiation failed: client connection to > 127.8.25.194:34669: TLS Handshake error: error:1416F086:SSL > routines:tls_process_server_certificate:certificate verify > failed:../ssl/statem/statem_clnt.c:1924 > Metrics: > {"client-negotiator.queue_time_us":90,"thread_start_us":41,"threads_started":1} > The python tests were successful before the commit > https://github.com/apache/kudu/commit/3343144fefaad5a30e95e21297c64c78e308fa1f > and I tried to remove this commit based on master and then the python tests > are success, seems the problem introduced by > https://github.com/apache/kudu/commit/3343144fefaad5a30e95e21297c64c78e308fa1f, > but I am sorry I can't fix this, could someone help me?Thanks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KUDU-3087) Python tests failed on arm64
[ https://issues.apache.org/jira/browse/KUDU-3087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17065046#comment-17065046 ] Alexey Serbin commented on KUDU-3087: - [~huangtianhua] what Linux distro are you using to run those python tests? > Python tests failed on arm64 > > > Key: KUDU-3087 > URL: https://issues.apache.org/jira/browse/KUDU-3087 > Project: Kudu > Issue Type: Sub-task >Reporter: huangtianhua >Assignee: Alexey Serbin >Priority: Major > Attachments: python_test.rar > > > I took python tests for kudu on arm64 platform based on > https://gerrit.cloudera.org/#/c/14964/ the tests failed, error info as > below: > W0323 02:54:39.938022 9110 negotiation.cc:313] Failed RPC negotiation. Trace: > 0323 02:54:39.936597 (+ 0us) reactor.cc:604] Submitting negotiation task > for client connection to 127.8.25.194:34669 > 0323 02:54:39.936737 (+ 140us) negotiation.cc:98] Waiting for socket to > connect > 0323 02:54:39.936746 (+ 9us) client_negotiation.cc:169] Beginning > negotiation > 0323 02:54:39.936810 (+64us) client_negotiation.cc:246] Sending NEGOTIATE > NegotiatePB request > 0323 02:54:39.937073 (+ 263us) client_negotiation.cc:263] Received > NEGOTIATE NegotiatePB response > 0323 02:54:39.937074 (+ 1us) client_negotiation.cc:357] Received > NEGOTIATE response from server > 0323 02:54:39.937079 (+ 5us) client_negotiation.cc:184] Negotiated > authn=TOKEN > 0323 02:54:39.937168 (+89us) client_negotiation.cc:473] Sending > TLS_HANDSHAKE message to server > 0323 02:54:39.937171 (+ 3us) client_negotiation.cc:246] Sending > TLS_HANDSHAKE NegotiatePB request > 0323 02:54:39.937724 (+ 553us) client_negotiation.cc:263] Received > TLS_HANDSHAKE NegotiatePB response > 0323 02:54:39.937726 (+ 2us) client_negotiation.cc:486] Received > TLS_HANDSHAKE response from server > 0323 02:54:39.937906 (+ 180us) negotiation.cc:304] Negotiation complete: > Runtime error: Client connection negotiation failed: client connection to > 127.8.25.194:34669: TLS Handshake error: error:1416F086:SSL > routines:tls_process_server_certificate:certificate verify > failed:../ssl/statem/statem_clnt.c:1924 > Metrics: > {"client-negotiator.queue_time_us":90,"thread_start_us":41,"threads_started":1} > The python tests were successful before the commit > https://github.com/apache/kudu/commit/3343144fefaad5a30e95e21297c64c78e308fa1f > and I tried to remove this commit based on master and then the python tests > are success, seems the problem introduced by > https://github.com/apache/kudu/commit/3343144fefaad5a30e95e21297c64c78e308fa1f, > but I am sorry I can't fix this, could someone help me?Thanks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (KUDU-3087) Python tests failed on arm64
[ https://issues.apache.org/jira/browse/KUDU-3087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17065025#comment-17065025 ] Alexey Serbin edited comment on KUDU-3087 at 3/23/20, 7:48 PM: --- Sure, I'll take a look. I suspect 768-bit crypto is considered too weak for OpenSSL security level 1 (we set the security level to 0 in our other tests), where security level 1 is the default for OpenSSL 1.1.x on older Linux distros (for CentOS8.1 the default OpenSSL security level is 2). was (Author: aserbin): Sure, I'll take a look. I suspect 768-bit crypto is considered too weak for OpenSSL security level 1 (we set the security level to 0 in our other tests), where security level 1 is the default for OpenSSL prior to 1.1.0 version. > Python tests failed on arm64 > > > Key: KUDU-3087 > URL: https://issues.apache.org/jira/browse/KUDU-3087 > Project: Kudu > Issue Type: Sub-task >Reporter: huangtianhua >Assignee: Alexey Serbin >Priority: Major > Attachments: python_test.rar > > > I took python tests for kudu on arm64 platform based on > https://gerrit.cloudera.org/#/c/14964/ the tests failed, error info as > below: > W0323 02:54:39.938022 9110 negotiation.cc:313] Failed RPC negotiation. Trace: > 0323 02:54:39.936597 (+ 0us) reactor.cc:604] Submitting negotiation task > for client connection to 127.8.25.194:34669 > 0323 02:54:39.936737 (+ 140us) negotiation.cc:98] Waiting for socket to > connect > 0323 02:54:39.936746 (+ 9us) client_negotiation.cc:169] Beginning > negotiation > 0323 02:54:39.936810 (+64us) client_negotiation.cc:246] Sending NEGOTIATE > NegotiatePB request > 0323 02:54:39.937073 (+ 263us) client_negotiation.cc:263] Received > NEGOTIATE NegotiatePB response > 0323 02:54:39.937074 (+ 1us) client_negotiation.cc:357] Received > NEGOTIATE response from server > 0323 02:54:39.937079 (+ 5us) client_negotiation.cc:184] Negotiated > authn=TOKEN > 0323 02:54:39.937168 (+89us) client_negotiation.cc:473] Sending > TLS_HANDSHAKE message to server > 0323 02:54:39.937171 (+ 3us) client_negotiation.cc:246] Sending > TLS_HANDSHAKE NegotiatePB request > 0323 02:54:39.937724 (+ 553us) client_negotiation.cc:263] Received > TLS_HANDSHAKE NegotiatePB response > 0323 02:54:39.937726 (+ 2us) client_negotiation.cc:486] Received > TLS_HANDSHAKE response from server > 0323 02:54:39.937906 (+ 180us) negotiation.cc:304] Negotiation complete: > Runtime error: Client connection negotiation failed: client connection to > 127.8.25.194:34669: TLS Handshake error: error:1416F086:SSL > routines:tls_process_server_certificate:certificate verify > failed:../ssl/statem/statem_clnt.c:1924 > Metrics: > {"client-negotiator.queue_time_us":90,"thread_start_us":41,"threads_started":1} > The python tests were successful before the commit > https://github.com/apache/kudu/commit/3343144fefaad5a30e95e21297c64c78e308fa1f > and I tried to remove this commit based on master and then the python tests > are success, seems the problem introduced by > https://github.com/apache/kudu/commit/3343144fefaad5a30e95e21297c64c78e308fa1f, > but I am sorry I can't fix this, could someone help me?Thanks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KUDU-3087) Python tests failed on arm64
[ https://issues.apache.org/jira/browse/KUDU-3087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17065025#comment-17065025 ] Alexey Serbin commented on KUDU-3087: - Sure, I'll take a look. I suspect 768-bit crypto is considered too weak for OpenSSL security level 1 (we set the security level to 0 in our other tests), where security level 1 is the default for OpenSSL prior to 1.1.0 version. > Python tests failed on arm64 > > > Key: KUDU-3087 > URL: https://issues.apache.org/jira/browse/KUDU-3087 > Project: Kudu > Issue Type: Sub-task >Reporter: huangtianhua >Assignee: Alexey Serbin >Priority: Major > Attachments: python_test.rar > > > I took python tests for kudu on arm64 platform based on > https://gerrit.cloudera.org/#/c/14964/ the tests failed, error info as > below: > W0323 02:54:39.938022 9110 negotiation.cc:313] Failed RPC negotiation. Trace: > 0323 02:54:39.936597 (+ 0us) reactor.cc:604] Submitting negotiation task > for client connection to 127.8.25.194:34669 > 0323 02:54:39.936737 (+ 140us) negotiation.cc:98] Waiting for socket to > connect > 0323 02:54:39.936746 (+ 9us) client_negotiation.cc:169] Beginning > negotiation > 0323 02:54:39.936810 (+64us) client_negotiation.cc:246] Sending NEGOTIATE > NegotiatePB request > 0323 02:54:39.937073 (+ 263us) client_negotiation.cc:263] Received > NEGOTIATE NegotiatePB response > 0323 02:54:39.937074 (+ 1us) client_negotiation.cc:357] Received > NEGOTIATE response from server > 0323 02:54:39.937079 (+ 5us) client_negotiation.cc:184] Negotiated > authn=TOKEN > 0323 02:54:39.937168 (+89us) client_negotiation.cc:473] Sending > TLS_HANDSHAKE message to server > 0323 02:54:39.937171 (+ 3us) client_negotiation.cc:246] Sending > TLS_HANDSHAKE NegotiatePB request > 0323 02:54:39.937724 (+ 553us) client_negotiation.cc:263] Received > TLS_HANDSHAKE NegotiatePB response > 0323 02:54:39.937726 (+ 2us) client_negotiation.cc:486] Received > TLS_HANDSHAKE response from server > 0323 02:54:39.937906 (+ 180us) negotiation.cc:304] Negotiation complete: > Runtime error: Client connection negotiation failed: client connection to > 127.8.25.194:34669: TLS Handshake error: error:1416F086:SSL > routines:tls_process_server_certificate:certificate verify > failed:../ssl/statem/statem_clnt.c:1924 > Metrics: > {"client-negotiator.queue_time_us":90,"thread_start_us":41,"threads_started":1} > The python tests were successful before the commit > https://github.com/apache/kudu/commit/3343144fefaad5a30e95e21297c64c78e308fa1f > and I tried to remove this commit based on master and then the python tests > are success, seems the problem introduced by > https://github.com/apache/kudu/commit/3343144fefaad5a30e95e21297c64c78e308fa1f, > but I am sorry I can't fix this, could someone help me?Thanks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KUDU-3084) Multiple time sources with fallback behavior between them
Alexey Serbin created KUDU-3084: --- Summary: Multiple time sources with fallback behavior between them Key: KUDU-3084 URL: https://issues.apache.org/jira/browse/KUDU-3084 Project: Kudu Issue Type: Improvement Components: master, tserver Reporter: Alexey Serbin [~tlipcon] suggested an alternative approach to configure and select HybridClock's time source. Kudu servers could maintain multiple time sources and switch between them with a fallback behavior. The default or preferred time source might be any of the existing ones (e.g., the built-in client), but when it's not available, another available time source is selected (e.g., {{system}} -- the NTP-synchronized local clock). Switching between time sources can be done: * only upon startup/initialization * upon startup/initialization and later during normal run time The advantages are: * easier deployment and configuration of Kudu clusters * simplified upgrade path from older releases using {{system}} time source to newer releases using {{builtin}} time source by default There are downsides, though. Since the new way of maintaining time source is more dynamic, it can: * mask various configuration or network issues * result in different time source within the same Kudu cluster due to transient issues * introduce extra startup delay -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KUDU-2928) built-in NTP client: tests to evaluate the behavior of the client
[ https://issues.apache.org/jira/browse/KUDU-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin resolved KUDU-2928. - Fix Version/s: 1.12.0 Resolution: Fixed Implemented with {{4aa0c7c0bc7d91af8be9a837b64f2a53fe31dd44}} > built-in NTP client: tests to evaluate the behavior of the client > - > > Key: KUDU-2928 > URL: https://issues.apache.org/jira/browse/KUDU-2928 > Project: Kudu > Issue Type: Sub-task > Components: clock, test >Affects Versions: 1.11.0 >Reporter: Alexey Serbin >Assignee: Alexey Serbin >Priority: Major > Labels: clock > Fix For: 1.12.0 > > > It's necessary to implement tests covering the behavior of the built-in NTP > client in various corner cases: > * A set of NTP servers which doesn't agree on time > * non-synchronized NTP server > * NTP server that loses track of its reference and becomes a false ticker > * etc. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KUDU-3082) tablets in "CONSENSUS_MISMATCH" state for a long time
[ https://issues.apache.org/jira/browse/KUDU-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17062763#comment-17062763 ] Alexey Serbin commented on KUDU-3082: - [~zhangyifan27], what Kudu version is that? > tablets in "CONSENSUS_MISMATCH" state for a long time > - > > Key: KUDU-3082 > URL: https://issues.apache.org/jira/browse/KUDU-3082 > Project: Kudu > Issue Type: Bug >Reporter: YifanZhang >Priority: Major > > Lately we found a few tablets in one of our clusters are unhealthy, the ksck > output is like: > > {code:java} > Tablet Summary > Tablet 7404240f458f462d92b6588d07583a52 of table '' is conflicted: 3 > replicas' active configs disagree with the leader master's > 7380d797d2ea49e88d71091802fb1c81 (kudu-ts26): RUNNING > d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING > 47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER] > All reported replicas are: > A = 7380d797d2ea49e88d71091802fb1c81 > B = d1952499f94a4e6087bee28466fcb09f > C = 47af52df1adc47e1903eb097e9c88f2e > D = 08beca5ed4d04003b6979bf8bac378d2 > The consensus matrix is: > Config source | Replicas | Current term | Config index | Committed? > ---+--+--+--+ > master| A B C* | | | Yes > A | A B C* | 5| -1 | Yes > B | A B C| 5| -1 | Yes > C | A B C* D~ | 5| 54649| No > Tablet 6d9d3fb034314fa7bee9cfbf602bcdc8 of table '' is conflicted: 2 > replicas' active configs disagree with the leader master's > d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING > 47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER] > 5a8aeadabdd140c29a09dabcae919b31 (kudu-ts21): RUNNING > All reported replicas are: > A = d1952499f94a4e6087bee28466fcb09f > B = 47af52df1adc47e1903eb097e9c88f2e > C = 5a8aeadabdd140c29a09dabcae919b31 > D = 14632cdbb0d04279bc772f64e06389f9 > The consensus matrix is: > Config source | Replicas | Current term | Config index | Committed? > ---+--+--+--+ > master| A B* C| | | Yes > A | A B* C| 5| 5| Yes > B | A B* C D~ | 5| 96176| No > C | A B* C| 5| 5| Yes > Tablet bf1ec7d693b94632b099dc0928e76363 of table '' is conflicted: 1 > replicas' active configs disagree with the leader master's > a9eaff3cf1ed483aae84954d649a (kudu-ts23): RUNNING > f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING > 47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER] > All reported replicas are: > A = a9eaff3cf1ed483aae84954d649a > B = f75df4a6b5ce404884313af5f906b392 > C = 47af52df1adc47e1903eb097e9c88f2e > D = d1952499f94a4e6087bee28466fcb09f > The consensus matrix is: > Config source | Replicas | Current term | Config index | Committed? > ---+--+--+--+ > master| A B C* | | | Yes > A | A B C* | 1| -1 | Yes > B | A B C* | 1| -1 | Yes > C | A B C* D~ | 1| 2| No > Tablet 3190a310857e4c64997adb477131488a of table '' is conflicted: 3 > replicas' active configs disagree with the leader master's > 47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER] > f0f7b2f4b9d344e6929105f48365f38e (kudu-ts24): RUNNING > f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING > All reported replicas are: > A = 47af52df1adc47e1903eb097e9c88f2e > B = f0f7b2f4b9d344e6929105f48365f38e > C = f75df4a6b5ce404884313af5f906b392 > D = d1952499f94a4e6087bee28466fcb09f > The consensus matrix is: > Config source | Replicas | Current term | Config index | Committed? > ---+--+--+--+ > master| A* B C| | | Yes > A | A* B C D~ | 1| 1991 | No > B | A* B C| 1| 4| Yes > C | A* B C| 1| 4| Yes{code} > These tablets couldn't recover for a couple of days until we restart > kudu-ts27. > I found so many duplicated logs in kudu-ts27 are like: > {code:java} > I0314 04:38:41.511279 65731 raft_consensus.cc:937] T > 7404240f458f462d92b6588d07583a52 P 47af52df1adc47e1903eb097e9c88f2e [term 3 > LEAD
[jira] [Updated] (KUDU-3067) Inexplict cloud detection for AWS and OpenStack based cloud by querying metadata
[ https://issues.apache.org/jira/browse/KUDU-3067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3067: Fix Version/s: 1.12.0 Resolution: Fixed Status: Resolved (was: In Review) > Inexplict cloud detection for AWS and OpenStack based cloud by querying > metadata > > > Key: KUDU-3067 > URL: https://issues.apache.org/jira/browse/KUDU-3067 > Project: Kudu > Issue Type: Bug >Reporter: liusheng >Assignee: Alexey Serbin >Priority: Major > Fix For: 1.12.0 > > > The cloud detector is used to check the cloud provider of the instance, see > [here|#L59-L93]], For AWS cloud it using the URL > [http://169.254.169.254/latest/meta-data/instance-id|http://169.254.169.254/latest/meta-data/instance-id*] > to check the specific metadata to determine it is AWS instance. This is OK, > but for OpenStack based cloud, the metadata is same with AWS, so this URL can > also be accessed. So this cannot distinct the AWS and other OpenStack based > clouds. This caused an issue when run > "HybridClockTest.TimeSourceAutoSelection" test case, this test will use the > above URL to detect the Cloud of instance current running on and then try to > call the NTP service, for AWS, the dedicated NTP service is > "169.254.169.123", but for OpenStack based cloud, there isn't such a > dedicated NTP service. So this test case will fail if I run on a instance of > OpenStack based cloud because the cloud detector suppose it is AWS instance > and try to access "169.254.169.123". > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-3067) Inexplict cloud detection for AWS and OpenStack based cloud by querying metadata
[ https://issues.apache.org/jira/browse/KUDU-3067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3067: Code Review: http://gerrit.cloudera.org:8080/15488 [~seanlau], could you verify that the fix published at http://gerrit.cloudera.org:8080/15488 works as expected? > Inexplict cloud detection for AWS and OpenStack based cloud by querying > metadata > > > Key: KUDU-3067 > URL: https://issues.apache.org/jira/browse/KUDU-3067 > Project: Kudu > Issue Type: Bug >Reporter: liusheng >Assignee: Alexey Serbin >Priority: Major > > The cloud detector is used to check the cloud provider of the instance, see > [here|#L59-L93]], For AWS cloud it using the URL > [http://169.254.169.254/latest/meta-data/instance-id|http://169.254.169.254/latest/meta-data/instance-id*] > to check the specific metadata to determine it is AWS instance. This is OK, > but for OpenStack based cloud, the metadata is same with AWS, so this URL can > also be accessed. So this cannot distinct the AWS and other OpenStack based > clouds. This caused an issue when run > "HybridClockTest.TimeSourceAutoSelection" test case, this test will use the > above URL to detect the Cloud of instance current running on and then try to > call the NTP service, for AWS, the dedicated NTP service is > "169.254.169.123", but for OpenStack based cloud, there isn't such a > dedicated NTP service. So this test case will fail if I run on a instance of > OpenStack based cloud because the cloud detector suppose it is AWS instance > and try to access "169.254.169.123". > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-3067) Inexplict cloud detection for AWS and OpenStack based cloud by querying metadata
[ https://issues.apache.org/jira/browse/KUDU-3067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3067: Status: In Review (was: In Progress) > Inexplict cloud detection for AWS and OpenStack based cloud by querying > metadata > > > Key: KUDU-3067 > URL: https://issues.apache.org/jira/browse/KUDU-3067 > Project: Kudu > Issue Type: Bug >Reporter: liusheng >Assignee: Alexey Serbin >Priority: Major > > The cloud detector is used to check the cloud provider of the instance, see > [here|#L59-L93]], For AWS cloud it using the URL > [http://169.254.169.254/latest/meta-data/instance-id|http://169.254.169.254/latest/meta-data/instance-id*] > to check the specific metadata to determine it is AWS instance. This is OK, > but for OpenStack based cloud, the metadata is same with AWS, so this URL can > also be accessed. So this cannot distinct the AWS and other OpenStack based > clouds. This caused an issue when run > "HybridClockTest.TimeSourceAutoSelection" test case, this test will use the > above URL to detect the Cloud of instance current running on and then try to > call the NTP service, for AWS, the dedicated NTP service is > "169.254.169.123", but for OpenStack based cloud, there isn't such a > dedicated NTP service. So this test case will fail if I run on a instance of > OpenStack based cloud because the cloud detector suppose it is AWS instance > and try to access "169.254.169.123". > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KUDU-3067) Inexplict cloud detection for AWS and OpenStack based cloud by querying metadata
[ https://issues.apache.org/jira/browse/KUDU-3067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17061907#comment-17061907 ] Alexey Serbin commented on KUDU-3067: - [~seanlau], the main thing that blocks me here is an access to an OpenStack cloud instance. If I put together a WIP patch for review at gerrit.cloudera.org, would you be able to verify it works for you? Also, if you could run the following curl command at one of your instances and post back the output, it would be great: {{curl -v http://169.254.169.254/latest/meta-data/instance-id}} Also, do you have an account at #kudu-general slack channel? Maybe, we can sync up over Slack on that. Thank you! > Inexplict cloud detection for AWS and OpenStack based cloud by querying > metadata > > > Key: KUDU-3067 > URL: https://issues.apache.org/jira/browse/KUDU-3067 > Project: Kudu > Issue Type: Bug >Reporter: liusheng >Assignee: Alexey Serbin >Priority: Major > > The cloud detector is used to check the cloud provider of the instance, see > [here|#L59-L93]], For AWS cloud it using the URL > [http://169.254.169.254/latest/meta-data/instance-id|http://169.254.169.254/latest/meta-data/instance-id*] > to check the specific metadata to determine it is AWS instance. This is OK, > but for OpenStack based cloud, the metadata is same with AWS, so this URL can > also be accessed. So this cannot distinct the AWS and other OpenStack based > clouds. This caused an issue when run > "HybridClockTest.TimeSourceAutoSelection" test case, this test will use the > above URL to detect the Cloud of instance current running on and then try to > call the NTP service, for AWS, the dedicated NTP service is > "169.254.169.123", but for OpenStack based cloud, there isn't such a > dedicated NTP service. So this test case will fail if I run on a instance of > OpenStack based cloud because the cloud detector suppose it is AWS instance > and try to access "169.254.169.123". > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-3067) Inexplict cloud detection for AWS and OpenStack based cloud by querying metadata
[ https://issues.apache.org/jira/browse/KUDU-3067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3067: Status: Open (was: In Review) > Inexplict cloud detection for AWS and OpenStack based cloud by querying > metadata > > > Key: KUDU-3067 > URL: https://issues.apache.org/jira/browse/KUDU-3067 > Project: Kudu > Issue Type: Bug >Reporter: liusheng >Assignee: Alexey Serbin >Priority: Major > > The cloud detector is used to check the cloud provider of the instance, see > [here|#L59-L93]], For AWS cloud it using the URL > [http://169.254.169.254/latest/meta-data/instance-id|http://169.254.169.254/latest/meta-data/instance-id*] > to check the specific metadata to determine it is AWS instance. This is OK, > but for OpenStack based cloud, the metadata is same with AWS, so this URL can > also be accessed. So this cannot distinct the AWS and other OpenStack based > clouds. This caused an issue when run > "HybridClockTest.TimeSourceAutoSelection" test case, this test will use the > above URL to detect the Cloud of instance current running on and then try to > call the NTP service, for AWS, the dedicated NTP service is > "169.254.169.123", but for OpenStack based cloud, there isn't such a > dedicated NTP service. So this test case will fail if I run on a instance of > OpenStack based cloud because the cloud detector suppose it is AWS instance > and try to access "169.254.169.123". > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KUDU-3067) Inexplict cloud detection for AWS and OpenStack based cloud by querying metadata
[ https://issues.apache.org/jira/browse/KUDU-3067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17061869#comment-17061869 ] Alexey Serbin commented on KUDU-3067: - Hi [~seanlau], Sure -- looking. > Inexplict cloud detection for AWS and OpenStack based cloud by querying > metadata > > > Key: KUDU-3067 > URL: https://issues.apache.org/jira/browse/KUDU-3067 > Project: Kudu > Issue Type: Bug >Reporter: liusheng >Assignee: Alexey Serbin >Priority: Major > > The cloud detector is used to check the cloud provider of the instance, see > [here|#L59-L93]], For AWS cloud it using the URL > [http://169.254.169.254/latest/meta-data/instance-id|http://169.254.169.254/latest/meta-data/instance-id*] > to check the specific metadata to determine it is AWS instance. This is OK, > but for OpenStack based cloud, the metadata is same with AWS, so this URL can > also be accessed. So this cannot distinct the AWS and other OpenStack based > clouds. This caused an issue when run > "HybridClockTest.TimeSourceAutoSelection" test case, this test will use the > above URL to detect the Cloud of instance current running on and then try to > call the NTP service, for AWS, the dedicated NTP service is > "169.254.169.123", but for OpenStack based cloud, there isn't such a > dedicated NTP service. So this test case will fail if I run on a instance of > OpenStack based cloud because the cloud detector suppose it is AWS instance > and try to access "169.254.169.123". > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-3067) Inexplict cloud detection for AWS and OpenStack based cloud by querying metadata
[ https://issues.apache.org/jira/browse/KUDU-3067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3067: Status: In Review (was: Open) > Inexplict cloud detection for AWS and OpenStack based cloud by querying > metadata > > > Key: KUDU-3067 > URL: https://issues.apache.org/jira/browse/KUDU-3067 > Project: Kudu > Issue Type: Bug >Reporter: liusheng >Assignee: Alexey Serbin >Priority: Major > > The cloud detector is used to check the cloud provider of the instance, see > [here|#L59-L93]], For AWS cloud it using the URL > [http://169.254.169.254/latest/meta-data/instance-id|http://169.254.169.254/latest/meta-data/instance-id*] > to check the specific metadata to determine it is AWS instance. This is OK, > but for OpenStack based cloud, the metadata is same with AWS, so this URL can > also be accessed. So this cannot distinct the AWS and other OpenStack based > clouds. This caused an issue when run > "HybridClockTest.TimeSourceAutoSelection" test case, this test will use the > above URL to detect the Cloud of instance current running on and then try to > call the NTP service, for AWS, the dedicated NTP service is > "169.254.169.123", but for OpenStack based cloud, there isn't such a > dedicated NTP service. So this test case will fail if I run on a instance of > OpenStack based cloud because the cloud detector suppose it is AWS instance > and try to access "169.254.169.123". > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KUDU-3075) SubprocessServerTest.TestTimeoutWhileQueueingCalls sometimes fails
Alexey Serbin created KUDU-3075: --- Summary: SubprocessServerTest.TestTimeoutWhileQueueingCalls sometimes fails Key: KUDU-3075 URL: https://issues.apache.org/jira/browse/KUDU-3075 Project: Kudu Issue Type: Bug Components: test Affects Versions: 1.12.0 Reporter: Alexey Serbin Attachments: subprocess_server-test.txt.xz The test scenario sometimes fails like below (at least in TSAN builds): {noformat} W0314 03:52:48.979025 3014 server.h:119] failed to send request: End of file: unable to send message: Other end of pipe was closed src/kudu/subprocess/subprocess_server-test.cc:233: Failure Value of: has_timeout_when_queueing Actual: false Expected: true expected at least one timeout {noformat} The log is attached. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KUDU-3073) BuiltinNtpWithMiniChronydTest.SyncAndUnsyncReferenceServers sometimes fails
Alexey Serbin created KUDU-3073: --- Summary: BuiltinNtpWithMiniChronydTest.SyncAndUnsyncReferenceServers sometimes fails Key: KUDU-3073 URL: https://issues.apache.org/jira/browse/KUDU-3073 Project: Kudu Issue Type: Bug Affects Versions: 1.12.0 Reporter: Alexey Serbin Attachments: ntp-test.txt.xz {noformat} src/kudu/clock/ntp-test.cc:478: Failure Value of: s.IsRuntimeError() Actual: false Expected: true OK src/kudu/clock/ntp-test.cc:595: Failure Expected: CheckNoNtpSource(sync_servers_refs) doesn't generate new fatal failures in the current thread. Actual: it does. {noformat} The log is attached. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (KUDU-2780) Rebalance Kudu cluster in background
[ https://issues.apache.org/jira/browse/KUDU-2780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin reassigned KUDU-2780: --- Code Review: https://gerrit.cloudera.org/#/c/14177/ Component/s: master Target Version/s: 1.12.0 Assignee: Hannah Nguyen (was: Alexey Serbin) > Rebalance Kudu cluster in background > > > Key: KUDU-2780 > URL: https://issues.apache.org/jira/browse/KUDU-2780 > Project: Kudu > Issue Type: Improvement > Components: master >Reporter: Alexey Serbin >Assignee: Hannah Nguyen >Priority: Major > Labels: roadmap-candidate > > With the introduction of `kudu cluster rebalance` CLI tool it's possible to > balance the distribution of tablet replicas in a Kudu cluster. However, that > tool should be run manually or via an external scheduler (e.g. cron). > It would be nice if Kudu would track and correct imbalances of replica > distribution automatically. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (KUDU-2780) Rebalance Kudu cluster in background
[ https://issues.apache.org/jira/browse/KUDU-2780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin reassigned KUDU-2780: --- Assignee: Alexey Serbin (was: Hannah Nguyen) > Rebalance Kudu cluster in background > > > Key: KUDU-2780 > URL: https://issues.apache.org/jira/browse/KUDU-2780 > Project: Kudu > Issue Type: Improvement >Reporter: Alexey Serbin >Assignee: Alexey Serbin >Priority: Major > Labels: roadmap-candidate > > With the introduction of `kudu cluster rebalance` CLI tool it's possible to > balance the distribution of tablet replicas in a Kudu cluster. However, that > tool should be run manually or via an external scheduler (e.g. cron). > It would be nice if Kudu would track and correct imbalances of replica > distribution automatically. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (KUDU-3067) Inexplict cloud detection for AWS and OpenStack based cloud by querying metadata
[ https://issues.apache.org/jira/browse/KUDU-3067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin reassigned KUDU-3067: --- Assignee: Alexey Serbin > Inexplict cloud detection for AWS and OpenStack based cloud by querying > metadata > > > Key: KUDU-3067 > URL: https://issues.apache.org/jira/browse/KUDU-3067 > Project: Kudu > Issue Type: Bug >Reporter: liusheng >Assignee: Alexey Serbin >Priority: Major > > The cloud detector is used to check the cloud provider of the instance, see > [here|#L59-L93]], For AWS cloud it using the URL > [http://169.254.169.254/latest/meta-data/instance-id|http://169.254.169.254/latest/meta-data/instance-id*] > to check the specific metadata to determine it is AWS instance. This is OK, > but for OpenStack based cloud, the metadata is same with AWS, so this URL can > also be accessed. So this cannot distinct the AWS and other OpenStack based > clouds. This caused an issue when run > "HybridClockTest.TimeSourceAutoSelection" test case, this test will use the > above URL to detect the Cloud of instance current running on and then try to > call the NTP service, for AWS, the dedicated NTP service is > "169.254.169.123", but for OpenStack based cloud, there isn't such a > dedicated NTP service. So this test case will fail if I run on a instance of > OpenStack based cloud because the cloud detector suppose it is AWS instance > and try to access "169.254.169.123". > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KUDU-3065) MasterReplicationAndRpcSizeLimitTest.TabletReports sometimes fails
Alexey Serbin created KUDU-3065: --- Summary: MasterReplicationAndRpcSizeLimitTest.TabletReports sometimes fails Key: KUDU-3065 URL: https://issues.apache.org/jira/browse/KUDU-3065 Project: Kudu Issue Type: Bug Components: test Affects Versions: 1.12.0 Reporter: Alexey Serbin Attachments: master_replication-itest.txt.xz The {{MasterReplicationAndRpcSizeLimitTest.TabletReports}} scenario sometimes fails with an error like below: {noformat} F0228 19:57:29.566195 236 test_workload.cc:330] Timed out: Timed out waiting for Table Creation *** Check failure stack trace: *** *** Aborted at 1582919849 (unix time) try "date -d @1582919849" if you are using GNU date *** PC: @ 0x7f62a6483c37 gsignal *** SIGABRT (@0x3e800ec) received by PID 236 (TID 0x7f62ab9ff8c0) from PID 236; stack trace: *** @ 0x7f62a94ca330 (unknown) at ??:0 @ 0x7f62a6483c37 gsignal at ??:0 @ 0x7f62a6487028 abort at ??:0 @ 0x7f62a74f2e09 google::logging_fail() at ??:0 @ 0x7f62a74f462d google::LogMessage::Fail() at ??:0 @ 0x7f62a74f664c google::LogMessage::SendToLog() at ??:0 @ 0x7f62a74f4189 google::LogMessage::Flush() at ??:0 @ 0x7f62a74f6fdf google::LogMessageFatal::~LogMessageFatal() at ??:0 @ 0x7f62ab5bf008 kudu::TestWorkload::Setup() at ??:0 @ 0x431230 kudu::master::MasterReplicationAndRpcSizeLimitTest_TabletReports_Test::TestBody() at src/kudu/integration-tests/master_replication-itest.cc:597 @ 0x7f62a851ab89 testing::internal::HandleExceptionsInMethodIfSupported<>() at ??:0 @ 0x7f62a850b68f testing::Test::Run() at ??:0 @ 0x7f62a850b74d testing::TestInfo::Run() at ??:0 @ 0x7f62a850b865 testing::TestCase::Run() at ??:0 @ 0x7f62a850bb28 testing::internal::UnitTestImpl::RunAllTests() at ??:0 @ 0x7f62a850bdc9 testing::UnitTest::Run() at ??:0 @ 0x7f62ab297cf3 RUN_ALL_TESTS() at ??:0 @ 0x7f62ab295cab main at ??:0 @ 0x7f62a646ef45 __libc_start_main at ??:0 @ 0x42a999 (unknown) at ??:? W0228 19:57:29.694170 6801 catalog_manager.cc:4593] T P d48d9641d5694fa783f13d0dce1ad5a9: Tablet 0113c6c5ccda4639a20e6cda5ce67acf (table test-workload [id=1f40b8e5e9454964ad4380766e8e2382]) was not created within the allowed timeout. Replacing with a new tablet 85e9b58f383a48cd807f7a49c6190867 {noformat} The log is attached. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KUDU-2573) Fully support Chrony in place of NTP
[ https://issues.apache.org/jira/browse/KUDU-2573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17048012#comment-17048012 ] Alexey Serbin commented on KUDU-2573: - After running several Kudu clusters at machines whose local clock is synchronized with {{chronyd}}, I think we can declare that it's safe to use {{chrony}} version 3.4 and newer instead of {{ntp}} on Kudu nodes. The important configuration option to be turned on is {{rtcsync}} as described [here|https://issues.apache.org/jira/browse/KUDU-2573?focusedCommentId=17029145&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17029145]. I updated the NTP troubleshooting docs: https://gerrit.cloudera.org/#/c/15320/ It's necessary to update the rest of the docs (e.g., building Kudu, etc.) and before updating the resolution of this JIRA item. > Fully support Chrony in place of NTP > > > Key: KUDU-2573 > URL: https://issues.apache.org/jira/browse/KUDU-2573 > Project: Kudu > Issue Type: New Feature > Components: clock, master, tserver >Reporter: Grant Henke >Assignee: Alexey Serbin >Priority: Major > Labels: clock > > This is to track fully supporting Chrony in place of NTP. Given Chrony is the > default in RHEL7+, running Kudu with Chrony is likely to be more common. > The work should entail: > * identifying and fixing or documenting any differences or gaps > * removing the experimental warnings from the documentation -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-2322) Leader spews logs when follower falls behind log GC
[ https://issues.apache.org/jira/browse/KUDU-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-2322: Fix Version/s: (was: 1.7.0) 1.7.1 > Leader spews logs when follower falls behind log GC > --- > > Key: KUDU-2322 > URL: https://issues.apache.org/jira/browse/KUDU-2322 > Project: Kudu > Issue Type: Bug > Components: consensus >Affects Versions: 1.7.0 >Reporter: Todd Lipcon >Assignee: Alexey Serbin >Priority: Critical > Fix For: 1.8.0, 1.7.1 > > > I'm running a YCSB-based write stress test and found that one of the > followers fell behind enough that its logs got GCed by the leader. At this > point, the leader started logging about 100 messages per second indicating > that it could not obtain a request for this peer. > I believe this is a regression since 1.6, since before 3-4-3 replication we > would have evicted the replica as soon as it fell behind GC. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-2342) Non-voter replicas can be promoted and get stuck
[ https://issues.apache.org/jira/browse/KUDU-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-2342: Fix Version/s: 1.8.0 > Non-voter replicas can be promoted and get stuck > > > Key: KUDU-2342 > URL: https://issues.apache.org/jira/browse/KUDU-2342 > Project: Kudu > Issue Type: Bug > Components: tablet >Affects Versions: 1.7.0 >Reporter: Mostafa Mokhtar >Assignee: Alexey Serbin >Priority: Blocker > Labels: scalability > Fix For: 1.8.0, 1.7.1 > > Attachments: Impala query profile.txt, tablet-info.html > > > While loading TPCH 30TB on 129 node cluster via Impala, write operation > failed with : > Query Status: Kudu error(s) reported, first error: Timed out: Failed to > write batch of 38590 ops to tablet b8431200388d486995a4426c88bc06a2 after 1 > attempt(s): Failed to write to server: a260dca5a9c846e99cb621881a7b86b8 > (vc1515.halxg.cloudera.com:7050): Write RPC to X.X.X.X:7050 timed out after > 180.000s (SENT) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-2342) Non-voter replicas can be promoted and get stuck
[ https://issues.apache.org/jira/browse/KUDU-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-2342: Fix Version/s: (was: 1.7.0) 1.7.1 > Non-voter replicas can be promoted and get stuck > > > Key: KUDU-2342 > URL: https://issues.apache.org/jira/browse/KUDU-2342 > Project: Kudu > Issue Type: Bug > Components: tablet >Affects Versions: 1.7.0 >Reporter: Mostafa Mokhtar >Assignee: Alexey Serbin >Priority: Blocker > Labels: scalability > Fix For: 1.7.1 > > Attachments: Impala query profile.txt, tablet-info.html > > > While loading TPCH 30TB on 129 node cluster via Impala, write operation > failed with : > Query Status: Kudu error(s) reported, first error: Timed out: Failed to > write batch of 38590 ops to tablet b8431200388d486995a4426c88bc06a2 after 1 > attempt(s): Failed to write to server: a260dca5a9c846e99cb621881a7b86b8 > (vc1515.halxg.cloudera.com:7050): Write RPC to X.X.X.X:7050 timed out after > 180.000s (SENT) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KUDU-3062) built-in NTP client: create a mock NTP server
Alexey Serbin created KUDU-3062: --- Summary: built-in NTP client: create a mock NTP server Key: KUDU-3062 URL: https://issues.apache.org/jira/browse/KUDU-3062 Project: Kudu Issue Type: Sub-task Reporter: Alexey Serbin To test the functionality of NTP packet sanitisation performed by the built-in NTP client, it's necessary to create a mock NTP server to simulate various corner cases. The motivation for this: running tests against chronyd as a test NTP server is great, but we need to make sure the client works with other NTP servers which are available in the wild. For more context, see https://gerrit.cloudera.org/#/c/15274/ -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (KUDU-3053) Make 'kudu cluster ksck' check and report on the time source settings in a Kudu cluster
[ https://issues.apache.org/jira/browse/KUDU-3053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin reassigned KUDU-3053: --- Assignee: Alexey Serbin > Make 'kudu cluster ksck' check and report on the time source settings in a > Kudu cluster > --- > > Key: KUDU-3053 > URL: https://issues.apache.org/jira/browse/KUDU-3053 > Project: Kudu > Issue Type: Improvement > Components: CLI >Reporter: Alexey Serbin >Assignee: Alexey Serbin >Priority: Major > Labels: clock > > Since the time source for the hybrid clock is now configurable (and even > auto-configurable), it's important to ensure that the time source is > configured consistently across the cluster. At least, there should be a way > to spot discrepancies in the time source across all Kudu servers > (masters/tservers). > Let's add extra functionality into the {{kudu cluster ksck}} logic to verify > that the time source is configured uniformly across Kudu cluster. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KUDU-3048) Add time/clock synchronization metrics
[ https://issues.apache.org/jira/browse/KUDU-3048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin resolved KUDU-3048. - Fix Version/s: 1.12.0 Resolution: Fixed > Add time/clock synchronization metrics > -- > > Key: KUDU-3048 > URL: https://issues.apache.org/jira/browse/KUDU-3048 > Project: Kudu > Issue Type: Improvement > Components: clock, master, tserver >Reporter: Alexey Serbin >Assignee: Alexey Serbin >Priority: Major > Labels: clock > Fix For: 1.12.0 > > > For better visibility, it would be great to add metrics reflecting time/clock > synchronization parameters: > * the stats on the max_error sampled while reading the underlying clock > * the stats on time intervals when the underlying clock was extrapolated > instead of using the actual readings: number of such intervals and stats on > the interval duration > * whether hybrid clock timestamps are generated using interpolated clock > readings instead of real ones > * if using the {{built-in}} time source: > ** difference between tracked true time and local wallclock > ** most recently computed true time > ** the stats on the maximum error of the computed true time > As for the rationale behind the new metrics: > * max_error shows how far the clock is from the true time, and maybe it's > time to use other set of NTP servers or instead increase the > {{\-\-max_clock_sync_error_usec}} flag value > * presence of the extrapolation intervals for the hybrid clock signals about > periods of non-availability for NTP servers, and possible action would be > re-visiting the set of NTP servers > * if hybrid timestamps are being extrapolated for some time, Kudu masters and > tablet servers might crash if the clock errors eventually goes beyond the > configured threshold: it's time to start troubleshooting the issue to avoid > possible non-availability of the cluster > * the delta between true time tracked by the built-in NTP client and the > local system clock is useful to understand how the log timestamps are related > to the HybridClock timestamps (in case of using the built-in NTP client those > might diverge) > * the stats on true time computed by the built-in NTP client give insights on > the quality of the reference NTP servers > The new metrics can be used for monitoring and alerting, allowing for > pro-active maintenance of a Kudu cluster. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-3048) Add time/clock synchronization metrics
[ https://issues.apache.org/jira/browse/KUDU-3048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3048: Description: For better visibility, it would be great to add metrics reflecting time/clock synchronization parameters: * the stats on the max_error sampled while reading the underlying clock * the stats on time intervals when the underlying clock was extrapolated instead of using the actual readings: number of such intervals and stats on the interval duration * whether hybrid clock timestamps are generated using interpolated clock readings instead of real ones * if using the {{built-in}} time source: ** difference between tracked true time and local wallclock ** most recently computed true time ** the stats on the maximum error of the computed true time As for the rationale behind the new metrics: * max_error shows how far the clock is from the true time, and maybe it's time to use other set of NTP servers or instead increase the {{\-\-max_clock_sync_error_usec}} flag value * presence of the extrapolation intervals for the hybrid clock signals about periods of non-availability for NTP servers, and possible action would be re-visiting the set of NTP servers * if hybrid timestamps are being extrapolated for some time, Kudu masters and tablet servers might crash if the clock errors eventually goes beyond the configured threshold: it's time to start troubleshooting the issue to avoid possible non-availability of the cluster * the delta between true time tracked by the built-in NTP client and the local system clock is useful to understand how the log timestamps are related to the HybridClock timestamps (in case of using the built-in NTP client those might diverge) * the stats on true time computed by the built-in NTP client give insights on the quality of the reference NTP servers The new metrics can be used for monitoring and alerting, allowing for pro-active maintenance of a Kudu cluster. was: For better visibility, it would be great to add metrics reflecting time/clock synchronization parameters: * the stats on the max_error sampled while reading the underlying clock * the stats on time intervals when the underlying clock was extrapolated instead of using the actual readings: number of such intervals and stats on the interval duration * whether hybrid clock timestamps are generated using interpolated clock readings instead of real ones * if using the {{built-in}} time source: ** difference between tracked true time and local wallclock ** most recently computed true time ** the stats on the maximum error of the computed true time As for the rationale behind the new metrics: * max_error shows how far the clock is from the true time, and maybe it's time to use other set of NTP servers or instead increase the {{\-\-max_clock_sync_error_usec}} flag value * presence of the extrapolation intervals for the hybrid clock signals about periods of non-availability for NTP servers, and possible action would be re-visiting the set of NTP servers * if hybrid timestamps are being extrapolated for some time, Kudu masters and tablet servers might crash if the clock errors eventually goes beyond the configured threshold: it's time to start troubleshooting the issue to avoid possible non-availability of the cluster The new metrics can be used for monitoring and alerting, allowing for pro-active maintenance of a Kudu cluster. > Add time/clock synchronization metrics > -- > > Key: KUDU-3048 > URL: https://issues.apache.org/jira/browse/KUDU-3048 > Project: Kudu > Issue Type: Improvement > Components: clock, master, tserver >Reporter: Alexey Serbin >Assignee: Alexey Serbin >Priority: Major > Labels: clock > > For better visibility, it would be great to add metrics reflecting time/clock > synchronization parameters: > * the stats on the max_error sampled while reading the underlying clock > * the stats on time intervals when the underlying clock was extrapolated > instead of using the actual readings: number of such intervals and stats on > the interval duration > * whether hybrid clock timestamps are generated using interpolated clock > readings instead of real ones > * if using the {{built-in}} time source: > ** difference between tracked true time and local wallclock > ** most recently computed true time > ** the stats on the maximum error of the computed true time > As for the rationale behind the new metrics: > * max_error shows how far the clock is from the true time, and maybe it's > time to use other set of NTP servers or instead increase the > {{\-\-max_clock_sync_error_usec}} flag value > * presence of the extrapolation intervals for the hybrid clock signals about > periods of non-availability for NTP servers, and possible action would be
[jira] [Commented] (KUDU-3058) RollingRestartITest.TestWorkloads sometimes fails
[ https://issues.apache.org/jira/browse/KUDU-3058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17041543#comment-17041543 ] Alexey Serbin commented on KUDU-3058: - Attaching one more failure log from ASAN pre-commit run. [^maintenance_mode-itest.00.txt.xz] > RollingRestartITest.TestWorkloads sometimes fails > - > > Key: KUDU-3058 > URL: https://issues.apache.org/jira/browse/KUDU-3058 > Project: Kudu > Issue Type: Bug > Components: test >Affects Versions: 1.12.0 >Reporter: Alexey Serbin >Priority: Minor > Attachments: maintenance_mode-itest.0.txt, > maintenance_mode-itest.0.txt.xz, maintenance_mode-itest.00.txt.xz > > > The scenario sometimes fails with an error like below: > {noformat} > 0219 20:46:16.414293 (+11us) service_pool.cc:221] Handling call > 0219 20:46:18.247943 (+1833650us) inbound_call.cc:162] Queueing success > response > Metrics: {} > I0219 20:46:19.529562 31739 ts_manager.cc:267] Unset tserver state for > f0df965344df403a86c95138c5e0f > 771 from MAINTENANCE_MODE > I0219 20:46:19.535373 31739 ts_manager.cc:267] Unset tserver state for > c63029ba5b4148ab90e9d437e9487 > c76 from MAINTENANCE_MODE > I0219 20:46:19.538385 31739 ts_manager.cc:267] Unset tserver state for > 4216503399e4476694010b88d3ab8cc5 from MAINTENANCE_MODE > I0219 20:46:19.542889 31739 ts_manager.cc:267] Unset tserver state for > 5080d97cb158428d8f86ab6797dd8149 from MAINTENANCE_MODE > /data0/jenkins/workspace/kudu-pre-commit-unittest-RELEASE/src/kudu/integration-tests/maintenance_mode-itest.cc:750: > Failure > Value of: s.ok() > Actual: true > Expected: false > /data0/jenkins/workspace/kudu-pre-commit-unittest-RELEASE/src/kudu/util/test_util.cc:345: > Failure > Failed > Timed out waiting for assertion to pass. > {noformat} > The log is attached. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-3058) RollingRestartITest.TestWorkloads sometimes fails
[ https://issues.apache.org/jira/browse/KUDU-3058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3058: Attachment: maintenance_mode-itest.00.txt.xz > RollingRestartITest.TestWorkloads sometimes fails > - > > Key: KUDU-3058 > URL: https://issues.apache.org/jira/browse/KUDU-3058 > Project: Kudu > Issue Type: Bug > Components: test >Affects Versions: 1.12.0 >Reporter: Alexey Serbin >Priority: Minor > Attachments: maintenance_mode-itest.0.txt, > maintenance_mode-itest.0.txt.xz, maintenance_mode-itest.00.txt.xz > > > The scenario sometimes fails with an error like below: > {noformat} > 0219 20:46:16.414293 (+11us) service_pool.cc:221] Handling call > 0219 20:46:18.247943 (+1833650us) inbound_call.cc:162] Queueing success > response > Metrics: {} > I0219 20:46:19.529562 31739 ts_manager.cc:267] Unset tserver state for > f0df965344df403a86c95138c5e0f > 771 from MAINTENANCE_MODE > I0219 20:46:19.535373 31739 ts_manager.cc:267] Unset tserver state for > c63029ba5b4148ab90e9d437e9487 > c76 from MAINTENANCE_MODE > I0219 20:46:19.538385 31739 ts_manager.cc:267] Unset tserver state for > 4216503399e4476694010b88d3ab8cc5 from MAINTENANCE_MODE > I0219 20:46:19.542889 31739 ts_manager.cc:267] Unset tserver state for > 5080d97cb158428d8f86ab6797dd8149 from MAINTENANCE_MODE > /data0/jenkins/workspace/kudu-pre-commit-unittest-RELEASE/src/kudu/integration-tests/maintenance_mode-itest.cc:750: > Failure > Value of: s.ok() > Actual: true > Expected: false > /data0/jenkins/workspace/kudu-pre-commit-unittest-RELEASE/src/kudu/util/test_util.cc:345: > Failure > Failed > Timed out waiting for assertion to pass. > {noformat} > The log is attached. -- This message was sent by Atlassian Jira (v8.3.4#803005)