[ https://issues.apache.org/jira/browse/KUDU-2819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mitch Barnett updated KUDU-2819: -------------------------------- Description: While utilizing the Kudu rebalancer utility, a SegFault is consistently occurring during run-time. The following is seen in the tablet server logs: {noformat} *** Aborted at 1556920300 (unix time) try "date -d @1556920300" if you are using GNU date *** PC: @ 0x2972aec tc_new *** SIGSEGV (@0x0) received by PID 62640 (TID 0x7f5f7191b980) from PID 0; stack trace: *** @ 0x369b00f7e0 (unknown) @ 0x2972aec tc_new @ 0xc6a077 kudu::client::KuduClient::Data::GetTableSchema() @ 0xc56e0d kudu::client::KuduClient::OpenTable() @ 0xc38228 kudu::tools::RemoteKsckCluster::RetrieveTablesList() @ 0xc2953a kudu::tools::KsckCluster::FetchTableAndTabletInfo() @ 0xc217c4 kudu::tools::Ksck::FetchTableAndTabletInfo() @ 0xdad2c1 kudu::tools::DoKsckForTablet() @ 0xdaf244 kudu::tools::CheckCompleteMove() @ 0xd84c18 kudu::tools::Rebalancer::AlgoBasedRunner::UpdateMovesInProgressStatus() @ 0xd816f4 kudu::tools::Rebalancer::RunWith() @ 0xd8dac6 kudu::tools::Rebalancer::Run() @ 0xb34011 (unknown) @ 0xb353a4 std::_Function_handler<>::_M_invoke() @ 0x10b7eda kudu::tools::Action::Run() @ 0xbb4f04 kudu::tools::DispatchCommand() @ 0xbb56d3 kudu::tools::RunTool() @ 0xad6778 main @ 0x369ac1ed1d __libc_start_main @ 0xb2ed7d (unknown) Segmentation fault (core dumped){noformat} Generating the backtrace of the core dump gives us the following, occurring within gperftools: {noformat} #0 SLL_Next (t=0x59c18bbfeed6371) at /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/thirdparty/src/gperftools-2.6.90/src/linked_list.h:45 #1 SLL_TryPop (rv=<synthetic pointer>, list=0x58d4d60) at /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/thirdparty/src/gperftools-2.6.90/src/linked_list.h:69 #2 TryPop (rv=<synthetic pointer>, this=0x58d4d60) at /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/thirdparty/src/gperftools-2.6.90/src/thread_cache.h:220 #3 Allocate (oom_handler=0x29711c0 <tcmalloc::cpp_throw_oom(unsigned long)>, cl=9, size=128, this=<optimized out>) at /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/thirdparty/src/gperftools-2.6.90/src/thread_cache.h:379 #4 malloc_fast_path<tcmalloc::cpp_throw_oom> (size=<optimized out>) at /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/thirdparty/src/gperftools-2.6.90/src/tcmalloc.cc:1848 #5 tc_new (size=<optimized out>) at /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/thirdparty/src/gperftools-2.6.90/src/tcmalloc.cc:1969 #6 0x0000000000c6a077 in allocate (__n=1, this=<synthetic pointer>) at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/ext/new_allocator.h:104 #7 allocate (__a=<synthetic pointer>, __n=1) at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/alloc_traits.h:357 #8 __shared_count<kudu::Synchronizer::Data, std::allocator<kudu::Synchronizer::Data> > (__a=..., this=0x7fff13bcbde8) at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/shared_ptr_base.h:616 #9 __shared_ptr<std::allocator<kudu::Synchronizer::Data> > (__a=..., __tag=..., this=0x7fff13bcbde0) at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/shared_ptr_base.h:1090 #10 shared_ptr<std::allocator<kudu::Synchronizer::Data> > (__a=..., __tag=..., this=0x7fff13bcbde0) at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/shared_ptr.h:316 #11 allocate_shared<kudu::Synchronizer::Data, std::allocator<kudu::Synchronizer::Data> > (__a=...) at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/shared_ptr.h:588 #12 make_shared<kudu::Synchronizer::Data> () at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/shared_ptr.h:604 #13 Synchronizer (this=0x7fff13bcbde0) at /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/src/kudu/util/async_util.h:47 #14 kudu::client::KuduClient::Data::GetTableSchema (this=<optimized out>, client=client@entry=0x11fe5440, table_name="impala::database.some_table", deadline=..., schema=schema@entry=0x7fff13bcc070, partition_schema=0x7fff13bcc0c0, table_id=0x7fff13bcc080, num_replicas=0x7fff13bcc068) at /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/src/kudu/client/client-internal.cc:441 #15 0x0000000000c56e0d in kudu::client::KuduClient::OpenTable (this=0x11fe5440, table_name="impala::database.some_table", table=table@entry=0x7fff13bcc180) at /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/src/kudu/client/client.cc:513 #16 0x0000000000c38228 in kudu::tools::RemoteKsckCluster::RetrieveTablesList (this=0x607d680) at /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/src/kudu/tools/ksck_remote.cc:502 #17 0x0000000000c2953a in kudu::tools::KsckCluster::FetchTableAndTabletInfo (this=0x607d680) at /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/src/kudu/tools/ksck.h:408 #18 0x0000000000c217c4 in kudu::tools::Ksck::FetchTableAndTabletInfo (this=this@entry=0x7fff13bcc510) at /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/src/kudu/tools/ksck.cc:302 ---Type <return> to continue, or q <return> to quit--- #19 0x0000000000dad2c1 in kudu::tools::DoKsckForTablet (master_addresses=std::vector of length 3, capacity 3 = {...}, tablet_id="00229fcb55dc4a348e8caae7f7a3fc41") at /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/src/kudu/tools/tool_replica_util.cc:624 #20 0x0000000000daf244 in kudu::tools::CheckCompleteMove (master_addresses=std::vector of length 3, capacity 3 = {...}, client=std::tr1::shared_ptr (count 1) 0x103345a0, tablet_id="00229fcb55dc4a348e8caae7f7a3fc41", from_ts_uuid="05d76878409e448fba542fade206dd15", to_ts_uuid="26d44b84ff3645d18f03b05a816e21eb", is_complete=0x7fff13bccb4f, completion_status=0x7fff13bccb50) at /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/src/kudu/tools/tool_replica_util.cc:319 #21 0x0000000000d84c18 in kudu::tools::Rebalancer::AlgoBasedRunner::UpdateMovesInProgressStatus (this=0x7fff13bcd090, has_errors=0x7fff13bccd40, timed_out=0x7fff13bcccdf) at /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/src/kudu/tools/rebalancer.cc:1173 #22 0x0000000000d816f4 in kudu::tools::Rebalancer::RunWith (this=this@entry=0x7fff13bd2390, runner=runner@entry=0x7fff13bcd090, result_status=result_status@entry=0x7fff13bd20ec) at /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/src/kudu/tools/rebalancer.cc:912 #23 0x0000000000d8dac6 in kudu::tools::Rebalancer::Run (this=this@entry=0x7fff13bd2390, result_status=result_status@entry=0x7fff13bd20ec, moves_count=moves_count@entry=0x7fff13bd21c8) at /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/src/kudu/tools/rebalancer.cc:203 #24 0x0000000000b34011 in kudu::tools::(anonymous namespace)::RunRebalance (context=...) at /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/src/kudu/tools/tool_action_cluster.cc:319 #25 0x0000000000b353a4 in std::_Function_handler<kudu::Status (kudu::tools::RunnerContext const&), kudu::Status (*)(kudu::tools::RunnerContext const&)>::_M_invoke(std::_Any_data const&, kudu::tools::RunnerContext const&) (__functor=..., __args#0=...) at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/functional:2025 #26 0x00000000010b7eda in operator() (__args#0=..., this=0x613a650) at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/functional:2439 Python Exception <class 'gdb.error'> There is no member or method named _M_element_count.: #27 kudu::tools::Action::Run (this=this@entry=0x613a630, chain=std::vector of length 2, capacity 2 = {...}, required_args=, variadic_args=std::vector of length 0, capacity 0) at /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/src/kudu/tools/tool_action.cc:258 #28 0x0000000000bb4f04 in kudu::tools::DispatchCommand (chain=std::vector of length 2, capacity 2 = {...}, action=action@entry=0x613a630, remaining_args=std::deque with 1 elements = {...}) at /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/src/kudu/tools/tool_main.cc:132 #29 0x0000000000bb56d3 in kudu::tools::RunTool (argc=4, argv=0x7fff13bd2960, show_help=show_help@entry=false) at /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/src/kudu/tools/tool_main.cc:204 #30 0x0000000000ad6778 in main (argc=4, argv=0x7fff13bd2960) at /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/src/kudu/tools/tool_main.cc:265{noformat} I don't see an obvious memory mismanagement scenario, like a double-free or use after free. I suspect there might either be corruption of memory at some point prior to this, or that there's a bug in tcmalloc itself. was: While utilizing the Kudu rebalancer utility, a SegFault is consistently occurring during run-time. The following is seen in the tablet server logs: {noformat} *** Aborted at 1556920300 (unix time) try "date -d @1556920300" if you are using GNU date *** PC: @ 0x2972aec tc_new *** SIGSEGV (@0x0) received by PID 62640 (TID 0x7f5f7191b980) from PID 0; stack trace: *** @ 0x369b00f7e0 (unknown) @ 0x2972aec tc_new @ 0xc6a077 kudu::client::KuduClient::Data::GetTableSchema() @ 0xc56e0d kudu::client::KuduClient::OpenTable() @ 0xc38228 kudu::tools::RemoteKsckCluster::RetrieveTablesList() @ 0xc2953a kudu::tools::KsckCluster::FetchTableAndTabletInfo() @ 0xc217c4 kudu::tools::Ksck::FetchTableAndTabletInfo() @ 0xdad2c1 kudu::tools::DoKsckForTablet() @ 0xdaf244 kudu::tools::CheckCompleteMove() @ 0xd84c18 kudu::tools::Rebalancer::AlgoBasedRunner::UpdateMovesInProgressStatus() @ 0xd816f4 kudu::tools::Rebalancer::RunWith() @ 0xd8dac6 kudu::tools::Rebalancer::Run() @ 0xb34011 (unknown) @ 0xb353a4 std::_Function_handler<>::_M_invoke() @ 0x10b7eda kudu::tools::Action::Run() @ 0xbb4f04 kudu::tools::DispatchCommand() @ 0xbb56d3 kudu::tools::RunTool() @ 0xad6778 main @ 0x369ac1ed1d __libc_start_main @ 0xb2ed7d (unknown) Segmentation fault (core dumped){noformat} Generating the backtrace of the core dump gives us the following, occurring within gperftools: {noformat} #0 SLL_Next (t=0x59c18bbfeed6371) at /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/thirdparty/src/gperftools-2.6.90/src/linked_list.h:45 #1 SLL_TryPop (rv=<synthetic pointer>, list=0x58d4d60) at /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/thirdparty/src/gperftools-2.6.90/src/linked_list.h:69 #2 TryPop (rv=<synthetic pointer>, this=0x58d4d60) at /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/thirdparty/src/gperftools-2.6.90/src/thread_cache.h:220 #3 Allocate (oom_handler=0x29711c0 <tcmalloc::cpp_throw_oom(unsigned long)>, cl=9, size=128, this=<optimized out>) at /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/thirdparty/src/gperftools-2.6.90/src/thread_cache.h:379 #4 malloc_fast_path<tcmalloc::cpp_throw_oom> (size=<optimized out>) at /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/thirdparty/src/gperftools-2.6.90/src/tcmalloc.cc:1848 #5 tc_new (size=<optimized out>) at /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/thirdparty/src/gperftools-2.6.90/src/tcmalloc.cc:1969 #6 0x0000000000c6a077 in allocate (__n=1, this=<synthetic pointer>) at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/ext/new_allocator.h:104 #7 allocate (__a=<synthetic pointer>, __n=1) at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/alloc_traits.h:357 #8 __shared_count<kudu::Synchronizer::Data, std::allocator<kudu::Synchronizer::Data> > (__a=..., this=0x7fff13bcbde8) at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/shared_ptr_base.h:616 #9 __shared_ptr<std::allocator<kudu::Synchronizer::Data> > (__a=..., __tag=..., this=0x7fff13bcbde0) at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/shared_ptr_base.h:1090 #10 shared_ptr<std::allocator<kudu::Synchronizer::Data> > (__a=..., __tag=..., this=0x7fff13bcbde0) at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/shared_ptr.h:316 #11 allocate_shared<kudu::Synchronizer::Data, std::allocator<kudu::Synchronizer::Data> > (__a=...) at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/shared_ptr.h:588 #12 make_shared<kudu::Synchronizer::Data> () at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/shared_ptr.h:604 #13 Synchronizer (this=0x7fff13bcbde0) at /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/src/kudu/util/async_util.h:47 #14 kudu::client::KuduClient::Data::GetTableSchema (this=<optimized out>, client=client@entry=0x11fe5440, table_name="impala::k_rawdata_invoice_sap.j_1atxrelt", deadline=..., schema=schema@entry=0x7fff13bcc070, partition_schema=0x7fff13bcc0c0, table_id=0x7fff13bcc080, num_replicas=0x7fff13bcc068) at /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/src/kudu/client/client-internal.cc:441 #15 0x0000000000c56e0d in kudu::client::KuduClient::OpenTable (this=0x11fe5440, table_name="impala::k_rawdata_invoice_sap.j_1atxrelt", table=table@entry=0x7fff13bcc180) at /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/src/kudu/client/client.cc:513 #16 0x0000000000c38228 in kudu::tools::RemoteKsckCluster::RetrieveTablesList (this=0x607d680) at /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/src/kudu/tools/ksck_remote.cc:502 #17 0x0000000000c2953a in kudu::tools::KsckCluster::FetchTableAndTabletInfo (this=0x607d680) at /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/src/kudu/tools/ksck.h:408 #18 0x0000000000c217c4 in kudu::tools::Ksck::FetchTableAndTabletInfo (this=this@entry=0x7fff13bcc510) at /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/src/kudu/tools/ksck.cc:302 ---Type <return> to continue, or q <return> to quit--- #19 0x0000000000dad2c1 in kudu::tools::DoKsckForTablet (master_addresses=std::vector of length 3, capacity 3 = {...}, tablet_id="00229fcb55dc4a348e8caae7f7a3fc41") at /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/src/kudu/tools/tool_replica_util.cc:624 #20 0x0000000000daf244 in kudu::tools::CheckCompleteMove (master_addresses=std::vector of length 3, capacity 3 = {...}, client=std::tr1::shared_ptr (count 1) 0x103345a0, tablet_id="00229fcb55dc4a348e8caae7f7a3fc41", from_ts_uuid="05d76878409e448fba542fade206dd15", to_ts_uuid="26d44b84ff3645d18f03b05a816e21eb", is_complete=0x7fff13bccb4f, completion_status=0x7fff13bccb50) at /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/src/kudu/tools/tool_replica_util.cc:319 #21 0x0000000000d84c18 in kudu::tools::Rebalancer::AlgoBasedRunner::UpdateMovesInProgressStatus (this=0x7fff13bcd090, has_errors=0x7fff13bccd40, timed_out=0x7fff13bcccdf) at /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/src/kudu/tools/rebalancer.cc:1173 #22 0x0000000000d816f4 in kudu::tools::Rebalancer::RunWith (this=this@entry=0x7fff13bd2390, runner=runner@entry=0x7fff13bcd090, result_status=result_status@entry=0x7fff13bd20ec) at /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/src/kudu/tools/rebalancer.cc:912 #23 0x0000000000d8dac6 in kudu::tools::Rebalancer::Run (this=this@entry=0x7fff13bd2390, result_status=result_status@entry=0x7fff13bd20ec, moves_count=moves_count@entry=0x7fff13bd21c8) at /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/src/kudu/tools/rebalancer.cc:203 #24 0x0000000000b34011 in kudu::tools::(anonymous namespace)::RunRebalance (context=...) at /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/src/kudu/tools/tool_action_cluster.cc:319 #25 0x0000000000b353a4 in std::_Function_handler<kudu::Status (kudu::tools::RunnerContext const&), kudu::Status (*)(kudu::tools::RunnerContext const&)>::_M_invoke(std::_Any_data const&, kudu::tools::RunnerContext const&) (__functor=..., __args#0=...) at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/functional:2025 #26 0x00000000010b7eda in operator() (__args#0=..., this=0x613a650) at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/functional:2439 Python Exception <class 'gdb.error'> There is no member or method named _M_element_count.: #27 kudu::tools::Action::Run (this=this@entry=0x613a630, chain=std::vector of length 2, capacity 2 = {...}, required_args=, variadic_args=std::vector of length 0, capacity 0) at /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/src/kudu/tools/tool_action.cc:258 #28 0x0000000000bb4f04 in kudu::tools::DispatchCommand (chain=std::vector of length 2, capacity 2 = {...}, action=action@entry=0x613a630, remaining_args=std::deque with 1 elements = {...}) at /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/src/kudu/tools/tool_main.cc:132 #29 0x0000000000bb56d3 in kudu::tools::RunTool (argc=4, argv=0x7fff13bd2960, show_help=show_help@entry=false) at /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/src/kudu/tools/tool_main.cc:204 #30 0x0000000000ad6778 in main (argc=4, argv=0x7fff13bd2960) at /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/src/kudu/tools/tool_main.cc:265{noformat} I don't see an obvious memory mismanagement scenario, like a double-free or use after free. I suspect there might either be corruption of memory at some point prior to this, or that there's a bug in tcmalloc itself. > SIGSEGV during kudu cluster rebalance > ------------------------------------- > > Key: KUDU-2819 > URL: https://issues.apache.org/jira/browse/KUDU-2819 > Project: Kudu > Issue Type: Bug > Affects Versions: 1.9.0 > Reporter: Mitch Barnett > Priority: Major > > While utilizing the Kudu rebalancer utility, a SegFault is consistently > occurring during run-time. > The following is seen in the tablet server logs: > {noformat} > *** Aborted at 1556920300 (unix time) try "date -d @1556920300" if you are > using GNU date *** > PC: @ 0x2972aec tc_new > *** SIGSEGV (@0x0) received by PID 62640 (TID 0x7f5f7191b980) from PID 0; > stack trace: *** > @ 0x369b00f7e0 (unknown) > @ 0x2972aec tc_new > @ 0xc6a077 kudu::client::KuduClient::Data::GetTableSchema() > @ 0xc56e0d kudu::client::KuduClient::OpenTable() > @ 0xc38228 kudu::tools::RemoteKsckCluster::RetrieveTablesList() > @ 0xc2953a kudu::tools::KsckCluster::FetchTableAndTabletInfo() > @ 0xc217c4 kudu::tools::Ksck::FetchTableAndTabletInfo() > @ 0xdad2c1 kudu::tools::DoKsckForTablet() > @ 0xdaf244 kudu::tools::CheckCompleteMove() > @ 0xd84c18 > kudu::tools::Rebalancer::AlgoBasedRunner::UpdateMovesInProgressStatus() > @ 0xd816f4 kudu::tools::Rebalancer::RunWith() > @ 0xd8dac6 kudu::tools::Rebalancer::Run() > @ 0xb34011 (unknown) > @ 0xb353a4 std::_Function_handler<>::_M_invoke() > @ 0x10b7eda kudu::tools::Action::Run() > @ 0xbb4f04 kudu::tools::DispatchCommand() > @ 0xbb56d3 kudu::tools::RunTool() > @ 0xad6778 main > @ 0x369ac1ed1d __libc_start_main > @ 0xb2ed7d (unknown) > Segmentation fault (core dumped){noformat} > > Generating the backtrace of the core dump gives us the following, occurring > within gperftools: > {noformat} > #0 SLL_Next (t=0x59c18bbfeed6371) > at > /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/thirdparty/src/gperftools-2.6.90/src/linked_list.h:45 > #1 SLL_TryPop (rv=<synthetic pointer>, list=0x58d4d60) > at > /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/thirdparty/src/gperftools-2.6.90/src/linked_list.h:69 > #2 TryPop (rv=<synthetic pointer>, this=0x58d4d60) > at > /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/thirdparty/src/gperftools-2.6.90/src/thread_cache.h:220 > #3 Allocate (oom_handler=0x29711c0 <tcmalloc::cpp_throw_oom(unsigned long)>, > cl=9, size=128, this=<optimized out>) > at > /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/thirdparty/src/gperftools-2.6.90/src/thread_cache.h:379 > #4 malloc_fast_path<tcmalloc::cpp_throw_oom> (size=<optimized out>) > at > /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/thirdparty/src/gperftools-2.6.90/src/tcmalloc.cc:1848 > #5 tc_new (size=<optimized out>) at > /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/thirdparty/src/gperftools-2.6.90/src/tcmalloc.cc:1969 > #6 0x0000000000c6a077 in allocate (__n=1, this=<synthetic pointer>) at > /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/ext/new_allocator.h:104 > #7 allocate (__a=<synthetic pointer>, __n=1) at > /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/alloc_traits.h:357 > #8 __shared_count<kudu::Synchronizer::Data, > std::allocator<kudu::Synchronizer::Data> > (__a=..., this=0x7fff13bcbde8) > at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/shared_ptr_base.h:616 > #9 __shared_ptr<std::allocator<kudu::Synchronizer::Data> > (__a=..., > __tag=..., this=0x7fff13bcbde0) > at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/shared_ptr_base.h:1090 > #10 shared_ptr<std::allocator<kudu::Synchronizer::Data> > (__a=..., > __tag=..., this=0x7fff13bcbde0) > at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/shared_ptr.h:316 > #11 allocate_shared<kudu::Synchronizer::Data, > std::allocator<kudu::Synchronizer::Data> > (__a=...) > at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/shared_ptr.h:588 > #12 make_shared<kudu::Synchronizer::Data> () at > /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/shared_ptr.h:604 > #13 Synchronizer (this=0x7fff13bcbde0) at > /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/src/kudu/util/async_util.h:47 > #14 kudu::client::KuduClient::Data::GetTableSchema (this=<optimized out>, > client=client@entry=0x11fe5440, table_name="impala::database.some_table", > deadline=..., schema=schema@entry=0x7fff13bcc070, > partition_schema=0x7fff13bcc0c0, table_id=0x7fff13bcc080, > num_replicas=0x7fff13bcc068) > at > /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/src/kudu/client/client-internal.cc:441 > #15 0x0000000000c56e0d in kudu::client::KuduClient::OpenTable > (this=0x11fe5440, table_name="impala::database.some_table", > table=table@entry=0x7fff13bcc180) > at > /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/src/kudu/client/client.cc:513 > #16 0x0000000000c38228 in kudu::tools::RemoteKsckCluster::RetrieveTablesList > (this=0x607d680) > at > /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/src/kudu/tools/ksck_remote.cc:502 > #17 0x0000000000c2953a in kudu::tools::KsckCluster::FetchTableAndTabletInfo > (this=0x607d680) > at > /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/src/kudu/tools/ksck.h:408 > #18 0x0000000000c217c4 in kudu::tools::Ksck::FetchTableAndTabletInfo > (this=this@entry=0x7fff13bcc510) > at > /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/src/kudu/tools/ksck.cc:302 > ---Type <return> to continue, or q <return> to quit--- > #19 0x0000000000dad2c1 in kudu::tools::DoKsckForTablet > (master_addresses=std::vector of length 3, capacity 3 = {...}, > tablet_id="00229fcb55dc4a348e8caae7f7a3fc41") > at > /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/src/kudu/tools/tool_replica_util.cc:624 > #20 0x0000000000daf244 in kudu::tools::CheckCompleteMove > (master_addresses=std::vector of length 3, capacity 3 = {...}, > client=std::tr1::shared_ptr (count 1) 0x103345a0, > tablet_id="00229fcb55dc4a348e8caae7f7a3fc41", > from_ts_uuid="05d76878409e448fba542fade206dd15", > to_ts_uuid="26d44b84ff3645d18f03b05a816e21eb", is_complete=0x7fff13bccb4f, > completion_status=0x7fff13bccb50) > at > /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/src/kudu/tools/tool_replica_util.cc:319 > #21 0x0000000000d84c18 in > kudu::tools::Rebalancer::AlgoBasedRunner::UpdateMovesInProgressStatus > (this=0x7fff13bcd090, has_errors=0x7fff13bccd40, > timed_out=0x7fff13bcccdf) at > /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/src/kudu/tools/rebalancer.cc:1173 > #22 0x0000000000d816f4 in kudu::tools::Rebalancer::RunWith > (this=this@entry=0x7fff13bd2390, runner=runner@entry=0x7fff13bcd090, > result_status=result_status@entry=0x7fff13bd20ec) at > /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/src/kudu/tools/rebalancer.cc:912 > #23 0x0000000000d8dac6 in kudu::tools::Rebalancer::Run > (this=this@entry=0x7fff13bd2390, > result_status=result_status@entry=0x7fff13bd20ec, > moves_count=moves_count@entry=0x7fff13bd21c8) at > /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/src/kudu/tools/rebalancer.cc:203 > #24 0x0000000000b34011 in kudu::tools::(anonymous namespace)::RunRebalance > (context=...) > at > /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/src/kudu/tools/tool_action_cluster.cc:319 > #25 0x0000000000b353a4 in std::_Function_handler<kudu::Status > (kudu::tools::RunnerContext const&), kudu::Status > (*)(kudu::tools::RunnerContext const&)>::_M_invoke(std::_Any_data const&, > kudu::tools::RunnerContext const&) (__functor=..., __args#0=...) at > /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/functional:2025 > #26 0x00000000010b7eda in operator() (__args#0=..., this=0x613a650) at > /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/functional:2439 > Python Exception <class 'gdb.error'> There is no member or method named > _M_element_count.: > #27 kudu::tools::Action::Run (this=this@entry=0x613a630, chain=std::vector of > length 2, capacity 2 = {...}, required_args=, > variadic_args=std::vector of length 0, capacity 0) > at > /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/src/kudu/tools/tool_action.cc:258 > #28 0x0000000000bb4f04 in kudu::tools::DispatchCommand (chain=std::vector of > length 2, capacity 2 = {...}, action=action@entry=0x613a630, > remaining_args=std::deque with 1 elements = {...}) at > /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/src/kudu/tools/tool_main.cc:132 > #29 0x0000000000bb56d3 in kudu::tools::RunTool (argc=4, argv=0x7fff13bd2960, > show_help=show_help@entry=false) > at > /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/src/kudu/tools/tool_main.cc:204 > #30 0x0000000000ad6778 in main (argc=4, argv=0x7fff13bd2960) > at > /container.redhat6/build/cdh/kudu/1.9.0-cdh6.2.0/rpm/BUILD/kudu-1.9.0-cdh6.2.0/src/kudu/tools/tool_main.cc:265{noformat} > > I don't see an obvious memory mismanagement scenario, like a double-free or > use after free. > I suspect there might either be corruption of memory at some point prior to > this, or that there's a bug in tcmalloc itself. -- This message was sent by Atlassian JIRA (v7.6.3#76005)