About two weeks back we upgraded our Cloudera cluster from 5.15 to 6.3.3 and as 
such Kudu was upgraded from 1.7 to 1.10.   Post upgrade Kudu ran smooth for 
over a week with no issues.  The subsequent weekend (Sunday) we had our normal 
monthly linux patch cycle and in doing so lost one of the servers/host where 
the Kudu role was running.  As such Kudu was started with 1 less host/tablet 
server (~10 disks per TS).  In doing so Kudu proceeded to replicate the tablet 
copies off of the down tablet server.

On Sunday evening we had some maintenance jobs run that drop/add partitions 
back to Kudu as part of our offload strategy between Kudu and HDFS.  At that 
time Kudu was still actively re-replicating data from the failed node.  Shortly 
after the add/drops a single tablet server crashed with no output in the FATAL 
or ERROR logs.   The only info we had was in the stderr log file.

Wrote minidump to 
/var/log/kudu/minidumps/kudu-tserver/b3d5ca56-f7ac-4f24-c7d53daa-3fa8cd0e.dmp
*** Aborted at 1601853208 (unix time) try "date -d @1601853208" if you are 
using GNU date ***
PC: @          0x200755d google::protobuf::TextFormat::Printer::Print()
*** SIGSEGV (@0x3800000001) received by PID 21069 (TID 0x7ff3f1a7a700) from PID 
1; stack trace: ***
    @     0x7ff4f3cb5630 (unknown)
    @          0x200755d google::protobuf::TextFormat::Printer::Print()
    @          0x20077ac google::protobuf::TextFormat::Printer::Print()
    @          0x200784d google::protobuf::TextFormat::Printer::PrintToString()
    @          0x1e7fa05 kudu::pb_util::SecureShortDebugString()
    @           0xaa1686 kudu::tablet::AlterSchemaTransactionState::ToString()
    @           0xaa154f kudu::tablet::AlterSchemaTransaction::ToString()
    @           0xaa452a kudu::tablet::TransactionDriver::ToString()
    @           0xaaae41 kudu::tablet::TransactionTracker::WaitForAllToFinish()
    @           0xaab4df kudu::tablet::TransactionTracker::WaitForAllToFinish()
    @           0xa9aa7a kudu::tablet::TabletReplica::Stop()
    @           0x9435f0 kudu::tserver::TSTabletManager::DeleteTablet()
    @           0x94bacf kudu::tserver::DeleteTabletRunnable::Run()
    @          0x1ea497f kudu::ThreadPool::DispatchThread()
    @          0x1e9bea4 kudu::Thread::SuperviseThread()
    @     0x7ff4f3cadea5 start_thread
    @     0x7ff4f1f838dd __clone

On Monday evening around 5 pm the same maintenance jobs ran for a different 
subset of Kudu tables and as such 7 tablet servers crashed shortly after the 
execution of the drop/adds.  Same behavior and the only output/generated error 
was in the stderr file and was the same as above.

We already have a case open with our vendor, but since this almost looks like 
an unhandled null pointer exception, we were wondering if anyone has seen 
something similar.

Thanks,
Joshua Picton


----------------------------------------------------------------------
The information contained in this e-mail may be privileged and confidential 
under applicable law. It is intended solely for the use of the person or firm 
named above. If the reader of this e-mail is not the intended recipient, please 
notify us immediately by returning the e-mail to the originating e-mail 
address. Availity, LLC is not responsible for errors or omissions in this 
e-mail message. Any personal comments made in this e-mail do not reflect the 
views of Availity, LLC.

Reply via email to