About two weeks back we upgraded our Cloudera cluster from 5.15 to 6.3.3 and as such Kudu was upgraded from 1.7 to 1.10. Post upgrade Kudu ran smooth for over a week with no issues. The subsequent weekend (Sunday) we had our normal monthly linux patch cycle and in doing so lost one of the servers/host where the Kudu role was running. As such Kudu was started with 1 less host/tablet server (~10 disks per TS). In doing so Kudu proceeded to replicate the tablet copies off of the down tablet server.
On Sunday evening we had some maintenance jobs run that drop/add partitions back to Kudu as part of our offload strategy between Kudu and HDFS. At that time Kudu was still actively re-replicating data from the failed node. Shortly after the add/drops a single tablet server crashed with no output in the FATAL or ERROR logs. The only info we had was in the stderr log file. Wrote minidump to /var/log/kudu/minidumps/kudu-tserver/b3d5ca56-f7ac-4f24-c7d53daa-3fa8cd0e.dmp *** Aborted at 1601853208 (unix time) try "date -d @1601853208" if you are using GNU date *** PC: @ 0x200755d google::protobuf::TextFormat::Printer::Print() *** SIGSEGV (@0x3800000001) received by PID 21069 (TID 0x7ff3f1a7a700) from PID 1; stack trace: *** @ 0x7ff4f3cb5630 (unknown) @ 0x200755d google::protobuf::TextFormat::Printer::Print() @ 0x20077ac google::protobuf::TextFormat::Printer::Print() @ 0x200784d google::protobuf::TextFormat::Printer::PrintToString() @ 0x1e7fa05 kudu::pb_util::SecureShortDebugString() @ 0xaa1686 kudu::tablet::AlterSchemaTransactionState::ToString() @ 0xaa154f kudu::tablet::AlterSchemaTransaction::ToString() @ 0xaa452a kudu::tablet::TransactionDriver::ToString() @ 0xaaae41 kudu::tablet::TransactionTracker::WaitForAllToFinish() @ 0xaab4df kudu::tablet::TransactionTracker::WaitForAllToFinish() @ 0xa9aa7a kudu::tablet::TabletReplica::Stop() @ 0x9435f0 kudu::tserver::TSTabletManager::DeleteTablet() @ 0x94bacf kudu::tserver::DeleteTabletRunnable::Run() @ 0x1ea497f kudu::ThreadPool::DispatchThread() @ 0x1e9bea4 kudu::Thread::SuperviseThread() @ 0x7ff4f3cadea5 start_thread @ 0x7ff4f1f838dd __clone On Monday evening around 5 pm the same maintenance jobs ran for a different subset of Kudu tables and as such 7 tablet servers crashed shortly after the execution of the drop/adds. Same behavior and the only output/generated error was in the stderr file and was the same as above. We already have a case open with our vendor, but since this almost looks like an unhandled null pointer exception, we were wondering if anyone has seen something similar. Thanks, Joshua Picton ---------------------------------------------------------------------- The information contained in this e-mail may be privileged and confidential under applicable law. It is intended solely for the use of the person or firm named above. If the reader of this e-mail is not the intended recipient, please notify us immediately by returning the e-mail to the originating e-mail address. Availity, LLC is not responsible for errors or omissions in this e-mail message. Any personal comments made in this e-mail do not reflect the views of Availity, LLC.