[jira] [Resolved] (KUDU-2480) tsan failure of master-stress-test

2019-05-20 Thread Andrew Wong (JIRA)


 [ 
https://issues.apache.org/jira/browse/KUDU-2480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Wong resolved KUDU-2480.
---
   Resolution: Fixed
Fix Version/s: 1.8.0

Dan merged this as 4479c20.

> tsan failure of master-stress-test
> --
>
> Key: KUDU-2480
> URL: https://issues.apache.org/jira/browse/KUDU-2480
> Project: Kudu
>  Issue Type: Test
>Reporter: Hao Hao
>Assignee: Dan Burkert
>Priority: Major
> Fix For: 1.8.0
>
> Attachments: master-stress-test.txt
>
>
> master-stress-test recently has been very flaky(~24%).  One of the failure log
> {noformat}WARNING: ThreadSanitizer: data race (pid=26513)
>  Read of size 8 at 0x7ffb5e5b88b8 by thread T65:
>  #0 kudu::Status::operator=(kudu::Status const&) 
> /data/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/util/status.h:469:7
>  (libmaster.so+0x10bd00)
>  #1 kudu::Synchronizer::StatusCB(kudu::Status const&) 
> /data/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/util/async_util.h:44:8
>  (libmaster.so+0x10bc40)
>  #2 kudu::internal::RunnableAdapter const&)>::Run(kudu::Synchronizer*, kudu::Status const&) 
> /data/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/gutil/bind_internal.h:192:12
>  (libmaster.so+0x10c708)
>  #3 kudu::internal::InvokeHelper kudu::internal::RunnableAdapter const&)>, void ()(kudu::Synchronizer*, kudu::Status 
> const&)>::MakeItSo(kudu::internal::RunnableAdapter (kudu::Synchronizer::*)(kudu::Status const&)>, kudu::Synchronizer*, 
> kudu::Status const&) 
> /data/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/gutil/bind_internal.h:889:14
>  (libmaster.so+0x10c5e8)
>  #4 kudu::internal::Invoker<1, 
> kudu::internal::BindState (kudu::Synchronizer::*)(kudu::Status const&)>, void ()(kudu::Synchronizer*, 
> kudu::Status const&), void 
> ()(kudu::internal::UnretainedWrapper)>, void 
> ()(kudu::Synchronizer*, kudu::Status 
> const&)>::Run(kudu::internal::BindStateBase*, kudu::Status const&) 
> /data/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/gutil/bind_internal.h:1118:12
>  (libmaster.so+0x10c51a)
>  #5 kudu::Callback::Run(kudu::Status const&) 
> const 
> /data/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/gutil/callback.h:436:12
>  (libmaster.so+0x10b831)
>  #6 kudu::master::HmsNotificationLogListenerTask::RunLoop() 
> /data/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/master/hms_notification_log_listener.cc:136:10
>  (libmaster.so+0x108e0a)
>  #7 boost::_mfi::mf0 kudu::master::HmsNotificationLogListenerTask>::operator()(kudu::master::HmsNotificationLogListenerTask*)
>  const 
> /data/somelongdirectorytoavoidrpathissues/src/kudu/thirdparty/installed/tsan/include/boost/bind/mem_fn_template.hpp:49:29
>  (libmaster.so+0x110ea9)
>  #8 void 
> boost::_bi::list1
>  >::operator() kudu::master::HmsNotificationLogListenerTask>, 
> boost::_bi::list0>(boost::_bi::type, boost::_mfi::mf0 kudu::master::HmsNotificationLogListenerTask>&, boost::_bi::list0&, int) 
> /data/somelongdirectorytoavoidrpathissues/src/kudu/thirdparty/installed/tsan/include/boost/bind/bind.hpp:259:9
>  (libmaster.so+0x110dfa)
>  #9 boost::_bi::bind_t kudu::master::HmsNotificationLogListenerTask>, 
> boost::_bi::list1
>  > >::operator()() 
> /data/somelongdirectorytoavoidrpathissues/src/kudu/thirdparty/installed/tsan/include/boost/bind/bind.hpp:1222:16
>  (libmaster.so+0x110d83)
>  #10 
> boost::detail::function::void_function_obj_invoker0 boost::_mfi::mf0, 
> boost::_bi::list1
>  > >, void>::invoke(boost::detail::function::function_buffer&) 
> /data/somelongdirectorytoavoidrpathissues/src/kudu/thirdparty/installed/tsan/include/boost/function/function_template.hpp:159:11
>  (libmaster.so+0x110b79)
>  #11 boost::function0::operator()() const 
> /data/somelongdirectorytoavoidrpathissues/src/kudu/thirdparty/installed/tsan/include/boost/function/function_template.hpp:770:14
>  (libkrpc.so+0xb64b1)
>  #12 kudu::Thread::SuperviseThread(void*) 
> /data/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/util/thread.cc:603:3
>  (libkudu_util.so+0x1bd8b4)
> Previous write of size 8 at 0x7ffb5e5b88b8 by thread T24 (mutexes: read 
> M1468):
>  #0 
> boost::intrusive::circular_list_algorithms
>  >::init(boost::intrusive::list_node* const&) 
> /data/somelongdirectorytoavoidrpathissues/src/kudu/thirdparty/installed/tsan/include/boost/intrusive/circular_list_algorithms.hpp:72:22
>  (libkrpc.so+0x99c92)
>  #1 
> boost::intrusive::generic_hook
>  >, boost::intrusive::dft_tag, (boost::intrusive::link_mode_type)1, 
> (boost::intrusive::base_hook_type)1>::generic_hook() 
> /data/somelongdirectorytoavoidrpathissues/src/kudu/thirdparty/installed/tsan/include/boost/intrusive/detail/generic_hook.hpp:174:10
>  (libkrpc.so+0xc4669)
>  #2 boost::intrusive::list_base_hook::list_base_hook() 
> /data/somelongdirectorytoavoidrpathissues/src/kudu/t

[jira] [Resolved] (KUDU-2472) master-stress-test flaky with failure to create table due to not enough tservers

2019-05-20 Thread Andrew Wong (JIRA)


 [ 
https://issues.apache.org/jira/browse/KUDU-2472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Wong resolved KUDU-2472.
---
   Resolution: Fixed
Fix Version/s: 1.8.0

Dan merged this as c1c15ad, noting a 1% flaky rate due to another issue.

> master-stress-test flaky with failure to create table due to not enough 
> tservers
> 
>
> Key: KUDU-2472
> URL: https://issues.apache.org/jira/browse/KUDU-2472
> Project: Kudu
>  Issue Type: Bug
>Reporter: Dan Burkert
>Assignee: Dan Burkert
>Priority: Major
> Fix For: 1.8.0
>
>
> Currently {{master-stress-test}} is 5-7% flaky, failing during a create table 
> operation:
> {code:java}
> F0611 20:58:01.335697 23508 master-stress-test.cc:217] Check failed: _s.ok() 
> Bad status: Invalid argument: Error creating table 
> default.table_6473953088b54f90af982172a0471cf6 on the master: not enough live 
> tablet servers to create a table with the requested replication factor 3; 2 
> tablet servers are alive{code}
> Due to the frequent master failovers introduced by the test, CREATE TABLE 
> operations are failing because not enough tablet servers are known to be 
> alive by the current leader master, who likely was just started and quickly 
> elected.
> In this case the master returns an InvalidArgument status to the client, 
> which is not retried.  This indicates a real issue that could occur in a 
> production cluster, if the leader master were restarted and quickly regained 
> leadership.  I'm not sure yet what the right fix is, I can think of at least 
> a few:
>  * Change the return status to be ServiceUnavailable. The client will retry 
> up to the timeout.  The downside is that in legitimate scenarios where there 
> aren't enough tablet servers the operation will take the full timeout to 
> fail, and probably have a less useful error status type.  Perhaps we could 
> have a heuristic which says that if the leader hasn't been active for at 
> least {{n * heartbeat_interval}} (where n is a small integer), then 
> ServiceUnavailable is used.
>  * Change master-stress-test to use replication 1 tables. This makes it much 
> less likely for the race to occur, although it's still possible.  This also 
> doesn't fix the underlying issue.
>  * Introduce a special case in the table creating thread of 
> master-stress-test to retry the specific {{InvalidArgument}} status.  Also 
> doesn't fix the underlying issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (KUDU-2395) Thread spike with all threads blocked in libnss

2019-05-20 Thread Alexey Serbin (JIRA)


[ 
https://issues.apache.org/jira/browse/KUDU-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16844162#comment-16844162
 ] 

Alexey Serbin edited comment on KUDU-2395 at 5/20/19 5:56 PM:
--

[~tlipcon] I think adding cache for resolved DNS entries should fix this issue, 
at least with the cache I don't expect the number of threads performing DNS 
resolution to jump that high.  But it would be nice to add some sort of test 
for that or at least test that scenario once manually once DNS cache is in 
place.

I'll prioritize revving https://gerrit.cloudera.org/#/c/13266/ this week.  
Thank you for the reminder.


was (Author: aserbin):
[~tlipcon] I think adding cache for resolved DNS entries should fix this issue, 
at least cached DNS names I don't expect the number of threads performing DNS 
resolution to go that high.  But it would be nice to add some sort of test for 
that (at least test that scenario once manually).

I'll prioritize revving https://gerrit.cloudera.org/#/c/13266/ this week.  
Thank you for the reminder.

> Thread spike with all threads blocked in libnss
> ---
>
> Key: KUDU-2395
> URL: https://issues.apache.org/jira/browse/KUDU-2395
> Project: Kudu
>  Issue Type: Bug
>  Components: consensus, tserver, util
>Reporter: Todd Lipcon
>Priority: Minor
>
> I saw the thread count on a server under a load test spike from 280 threads 
> (fairly constant) to 3400 threads (briefly). I checked the diagnostics log 
> and found that there are several thousand threads in a stack like:
> {code}
> 0x7facce018606 _nss_files_gethostbyname2_r
>   0x345a703645 
>   0x345a6d0b3b 
>   0x345a6d2d80 
>  0x1c9366c kudu::(anonymous namespace)::GetAddrInfo()
>  0x1c95fbe kudu::HostPort::ResolveAddresses()
>   0xac4b78 kudu::consensus::(anonymous 
> namespace)::CreateConsensusServiceProxyForHost()
>   0xac5058 kudu::consensus::RpcPeerProxyFactory::NewProxy()
>   0xb0b212 kudu::consensus::LeaderElection::LeaderElection()
>   0xafab80 kudu::consensus::RaftConsensus::StartElection()
>   0xafd20c kudu::consensus::RaftConsensus::ReportFailureDetectedTask()
>  0x1ccf4ed kudu::FunctionRunnable::Run()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-2395) Thread spike with all threads blocked in libnss

2019-05-20 Thread Alexey Serbin (JIRA)


[ 
https://issues.apache.org/jira/browse/KUDU-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16844162#comment-16844162
 ] 

Alexey Serbin commented on KUDU-2395:
-

[~tlipcon] I think adding cache for resolved DNS entries should fix this issue, 
at least cached DNS names I don't expect the number of threads performing DNS 
resolution to go that high.  But it would be nice to add some sort of test for 
that (at least test that scenario once manually).

I'll prioritize revving https://gerrit.cloudera.org/#/c/13266/ this week.  
Thank you for the reminder.

> Thread spike with all threads blocked in libnss
> ---
>
> Key: KUDU-2395
> URL: https://issues.apache.org/jira/browse/KUDU-2395
> Project: Kudu
>  Issue Type: Bug
>  Components: consensus, tserver, util
>Reporter: Todd Lipcon
>Priority: Minor
>
> I saw the thread count on a server under a load test spike from 280 threads 
> (fairly constant) to 3400 threads (briefly). I checked the diagnostics log 
> and found that there are several thousand threads in a stack like:
> {code}
> 0x7facce018606 _nss_files_gethostbyname2_r
>   0x345a703645 
>   0x345a6d0b3b 
>   0x345a6d2d80 
>  0x1c9366c kudu::(anonymous namespace)::GetAddrInfo()
>  0x1c95fbe kudu::HostPort::ResolveAddresses()
>   0xac4b78 kudu::consensus::(anonymous 
> namespace)::CreateConsensusServiceProxyForHost()
>   0xac5058 kudu::consensus::RpcPeerProxyFactory::NewProxy()
>   0xb0b212 kudu::consensus::LeaderElection::LeaderElection()
>   0xafab80 kudu::consensus::RaftConsensus::StartElection()
>   0xafd20c kudu::consensus::RaftConsensus::ReportFailureDetectedTask()
>  0x1ccf4ed kudu::FunctionRunnable::Run()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-2395) Thread spike with all threads blocked in libnss

2019-05-20 Thread Todd Lipcon (JIRA)


[ 
https://issues.apache.org/jira/browse/KUDU-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16844126#comment-16844126
 ] 

Todd Lipcon commented on KUDU-2395:
---

[~aserbin] do we expect this will be fully fixed by KUDU-2791?

> Thread spike with all threads blocked in libnss
> ---
>
> Key: KUDU-2395
> URL: https://issues.apache.org/jira/browse/KUDU-2395
> Project: Kudu
>  Issue Type: Bug
>  Components: consensus, tserver, util
>Reporter: Todd Lipcon
>Priority: Minor
>
> I saw the thread count on a server under a load test spike from 280 threads 
> (fairly constant) to 3400 threads (briefly). I checked the diagnostics log 
> and found that there are several thousand threads in a stack like:
> {code}
> 0x7facce018606 _nss_files_gethostbyname2_r
>   0x345a703645 
>   0x345a6d0b3b 
>   0x345a6d2d80 
>  0x1c9366c kudu::(anonymous namespace)::GetAddrInfo()
>  0x1c95fbe kudu::HostPort::ResolveAddresses()
>   0xac4b78 kudu::consensus::(anonymous 
> namespace)::CreateConsensusServiceProxyForHost()
>   0xac5058 kudu::consensus::RpcPeerProxyFactory::NewProxy()
>   0xb0b212 kudu::consensus::LeaderElection::LeaderElection()
>   0xafab80 kudu::consensus::RaftConsensus::StartElection()
>   0xafd20c kudu::consensus::RaftConsensus::ReportFailureDetectedTask()
>  0x1ccf4ed kudu::FunctionRunnable::Run()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (KUDU-2366) LockManager consumes significant memory

2019-05-20 Thread Todd Lipcon (JIRA)


 [ 
https://issues.apache.org/jira/browse/KUDU-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon reassigned KUDU-2366:
-

Assignee: (was: Todd Lipcon)

> LockManager consumes significant memory
> ---
>
> Key: KUDU-2366
> URL: https://issues.apache.org/jira/browse/KUDU-2366
> Project: Kudu
>  Issue Type: Improvement
>  Components: tablet
>Affects Versions: 1.7.0
>Reporter: Todd Lipcon
>Priority: Major
>
> Looking at a heap dump of a server that's been running for a while with an 
> ingest workload across multiple tables, I see the LockManager is using about 
> 200MB of RAM. The workload in this case has batches of about 30,000 rows 
> each, so while each batch is in flight the LockManager hashtable has that 
> many locks in it. That causes it to expand to the next higher power of two 
> (64k slots). Each slot takes 16 bytes, so the lock table is reaching about 
> 1MB. We never resize _down_, so even once the tablet becomes cold, it still 
> uses 1M of unrecoverable RAM for the rest of the tserver lifetime.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)