Kudu 1.10.0-cdh6.3.3 - SIGSEGV crash on several Production tablet servers at around the same time

Joshua Picton Wed, 07 Oct 2020 07:55:10 -0700

About two weeks back we upgraded our Cloudera cluster from 5.15 to 6.3.3 and as 
such Kudu was upgraded from 1.7 to 1.10.   Post upgrade Kudu ran smooth for 
over a week with no issues.  The subsequent weekend (Sunday) we had our normal 
monthly linux patch cycle and in doing so lost one of the servers/host where 
the Kudu role was running.  As such Kudu was started with 1 less host/tablet 
server (~10 disks per TS).  In doing so Kudu proceeded to replicate the tablet 
copies off of the down tablet server.


On Sunday evening we had some maintenance jobs run that drop/add partitions 
back to Kudu as part of our offload strategy between Kudu and HDFS.  At that 
time Kudu was still actively re-replicating data from the failed node.  Shortly 
after the add/drops a single tablet server crashed with no output in the FATAL 
or ERROR logs.   The only info we had was in the stderr log file.

Wrote minidump to 
/var/log/kudu/minidumps/kudu-tserver/b3d5ca56-f7ac-4f24-c7d53daa-3fa8cd0e.dmp
*** Aborted at 1601853208 (unix time) try "date -d @1601853208" if you are 
using GNU date ***
PC: @          0x200755d google::protobuf::TextFormat::Printer::Print()
*** SIGSEGV (@0x3800000001) received by PID 21069 (TID 0x7ff3f1a7a700) from PID 
1; stack trace: ***
    @     0x7ff4f3cb5630 (unknown)
    @          0x200755d google::protobuf::TextFormat::Printer::Print()
    @          0x20077ac google::protobuf::TextFormat::Printer::Print()
    @          0x200784d google::protobuf::TextFormat::Printer::PrintToString()
    @          0x1e7fa05 kudu::pb_util::SecureShortDebugString()
    @           0xaa1686 kudu::tablet::AlterSchemaTransactionState::ToString()
    @           0xaa154f kudu::tablet::AlterSchemaTransaction::ToString()
    @           0xaa452a kudu::tablet::TransactionDriver::ToString()
    @           0xaaae41 kudu::tablet::TransactionTracker::WaitForAllToFinish()
    @           0xaab4df kudu::tablet::TransactionTracker::WaitForAllToFinish()
    @           0xa9aa7a kudu::tablet::TabletReplica::Stop()
    @           0x9435f0 kudu::tserver::TSTabletManager::DeleteTablet()
    @           0x94bacf kudu::tserver::DeleteTabletRunnable::Run()
    @          0x1ea497f kudu::ThreadPool::DispatchThread()
    @          0x1e9bea4 kudu::Thread::SuperviseThread()
    @     0x7ff4f3cadea5 start_thread
    @     0x7ff4f1f838dd __clone

On Monday evening around 5 pm the same maintenance jobs ran for a different 
subset of Kudu tables and as such 7 tablet servers crashed shortly after the 
execution of the drop/adds.  Same behavior and the only output/generated error 
was in the stderr file and was the same as above.

We already have a case open with our vendor, but since this almost looks like 
an unhandled null pointer exception, we were wondering if anyone has seen 
something similar.

Thanks,
Joshua Picton


----------------------------------------------------------------------
The information contained in this e-mail may be privileged and confidential 
under applicable law. It is intended solely for the use of the person or firm 
named above. If the reader of this e-mail is not the intended recipient, please 
notify us immediately by returning the e-mail to the originating e-mail 
address. Availity, LLC is not responsible for errors or omissions in this 
e-mail message. Any personal comments made in this e-mail do not reflect the 
views of Availity, LLC.

Kudu 1.10.0-cdh6.3.3 - SIGSEGV crash on several Production tablet servers at around the same time

Reply via email to