xy720 opened a new pull request, #48101:
URL: https://github.com/apache/doris/pull/48101
### What problem does this PR solve?
coredump:
```
F20250219 16:11:04.432318 1252192 global.cpp:309]
/data/home/lambxu/work/git/doris-tencent/doris-2.1/thirdparty/installed/include/google/protobuf/map
.h:1293 CHECK failed: it != end(): key not found: 184236
*** Check failure stack trace: ***
@ 0x5626cb1376e6 google::LogMessage::SendToLog()
@ 0x5626cb134130 google::LogMessage::Flush()
@ 0x5626cb137f29 google::LogMessageFatal::~LogMessageFatal()
@ 0x5626ccc270ea (unknown)
@ 0x5626cb86bfa5 google::protobuf::internal::LogMessage::Finish()
@ 0x5626c15f6e1e google::protobuf::Map<>::at<>()
@ 0x5626c15f3604 doris::TabletsChannel::_commit_txn()
@ 0x5626c15f2f8b doris::TabletsChannel::close()
@ 0x5626c14fb43a doris::LoadChannel::_handle_eos()
@ 0x5626c14fb0f2 doris::LoadChannel::add_batch()
@ 0x5626c14f5800 doris::LoadChannelMgr::add_batch()
@ 0x5626c1667811 std::_Function_handler<>::_M_invoke()
@ 0x5626c168199b doris::WorkThreadPool<>::work_thread()
@ 0x5626cdf081a0 execute_native_thread_routine
@ 0x7efdc1a25215 start_thread
@ 0x7efdc1aa7bdc __clone3
@ (nil) (unknown)
*** Query id: 9c94fa66404748-ab7987b4563a6318 ***
*** is nereids: 0 ***
*** tablet id: 0 ***
*** Aborted at 1739952664 (unix time) try "date -d @1739952664" if you are
using GNU date ***
*** Current BE git commitID: f112af0fd2 ***
*** SIGABRT unknown detail explain (@0x3e8001317ab) received by PID 1251243
(TID 1252192 OR 0x7efbc38bd6c0) from PID 1251243; stack trace: ***
0# doris::signal::(anonymous namespace)::FailureSignalHandler(int,
siginfo_t*, void*) at
/data/home/lambxu/work/git/doris-tencent/doris-2.1/be/src/c
ommon/signal_handler.h:421
1# 0x00007EFDC19D7AD0 in /lib64/libc.so.6
2# __pthread_kill_implementation in /lib64/libc.so.6
3# raise in /lib64/libc.so.6
4# __GI_abort in /lib64/libc.so.6
5# 0x00005626CB141FBD in /usr/local/service/doris/lib/be/doris_be
6# 0x00005626CB1345FA in /usr/local/service/doris/lib/be/doris_be
7# google::LogMessage::SendToLog() in
/usr/local/service/doris/lib/be/doris_be
8# google::LogMessage::Flush() in /usr/local/service/doris/lib/be/doris_be
9# google::LogMessageFatal::~LogMessageFatal() in
/usr/local/service/doris/lib/be/doris_be
10# 0x00005626CCC270EA in /usr/local/service/doris/lib/be/doris_be
11# google::protobuf::internal::LogMessage::Finish() in
/usr/local/service/doris/lib/be/doris_be
12# doris::PSlaveTabletNodes const& google::protobuf::Map<long,
doris::PSlaveTabletNodes>::at<long>(long const&) const in
/usr/local/service/doris/li
b/be/doris_be
13# doris::TabletsChannel::_commit_txn(doris::DeltaWriter*,
doris::PTabletWriterAddBlockRequest const&,
doris::PTabletWriterAddBlockResult*) in /usr/
local/service/doris/lib/be/doris_be
14# doris::TabletsChannel::close(doris::LoadChannel*,
doris::PTabletWriterAddBlockRequest const&,
doris::PTabletWriterAddBlockResult*, bool*) at /dat
a/home/lambxu/work/git/doris-tencent/doris-2.1/be/src/runtime/tablets_channel.cpp:367
15# doris::LoadChannel::_handle_eos(doris::BaseTabletsChannel*,
doris::PTabletWriterAddBlockRequest const&,
doris::PTabletWriterAddBlockResult*) at /
data/home/lambxu/work/git/doris-tencent/doris-2.1/be/src/runtime/load_channel.cpp:191
16# doris::LoadChannel::add_batch(doris::PTabletWriterAddBlockRequest
const&, doris::PTabletWriterAddBlockResult*) at /data/home/lambxu/work/git/dori
s-tencent/doris-2.1/be/src/runtime/load_channel.cpp:172
17# doris::LoadChannelMgr::add_batch(doris::PTabletWriterAddBlockRequest
const&, doris::PTabletWriterAddBlockResult*) at /data/home/lambxu/work/git/d
oris-tencent/doris-2.1/be/src/runtime/load_channel_mgr.cpp:156
18# std::_Function_handler<void (),
doris::PInternalServiceImpl::tablet_writer_add_block(google::protobuf::RpcController*,
doris::PTabletWriterAddBlo
ckRequest const*, doris::PTabletWriterAddBlockResult*,
google::protobuf::Closure*)::$_0>::_M_invoke(std::_Any_data const&) at
/data/home/lambxu/insta
lls/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_function.h:291
19# doris::WorkThreadPool<false>::work_thread(int) at
/data/home/lambxu/work/git/doris-tencent/doris-2.1/be/src/util/work_thread_pool.hpp:159
20# execute_native_thread_routine at
../../../../../libstdc++-v3/src/c++11/thread.cc:84
21# start_thread in /lib64/libc.so.6
22# __GI___clone3 in /lib64/libc.so.6
```
This is because the PTabletWriterAddBlockRequest sent from coordinator Be
may not carry the newest slave nodes.
e.x. The log :
```
I20250219 16:15:58.919366 1260197 vtablet_writer.cpp:988]
VNodeChannel[151455-10002], load_id=fa58b8bcfada49e4-827999f3d6e97fe5,
txn_id=40574, node=1
0.0.19.244:8060 mark closed, left pending batch size: 1
I20250219 16:15:58.921607 1260201 vrow_distribution.cpp:98] [DEBUG]
VRowDistribution::automatic_create_partition, request:
TCreatePartitionRequest(txn_id=40574, db_id=11403, table_id=151454,
partitionValues=[[TNullableStringLiteral(value=xxxx, is_null=0)]])
I20250219 16:15:58.925675 1256625 tablets_channel.cpp:180] [DEBUG]
BaseTabletsChannel::incremental_open
TabletsChannelKey id {
hi: -407372644475057692
lo: -9045029104035201051
}
W20250219 16:15:58.946277 1256557 tablets_channel.cpp:400] [DEBUG]
TabletsChannel::_commit_txn
TabletsChannelKey id {
hi: -407372644475057692
lo: -9045029104035201051
}
F20250219 17:51:13.139384 14966 global.cpp:309]
/data/home/lambxu/work/git/doris-master/doris/thirdparty/installed/include/google/protobuf/map.h:1293
CHECK failed: it != end(): key not found: 184236
```
I add some debug log, you can see that after node channel mark close, the
incremental open request arrive before the add block request.
Then the new DeltaWriter create by incremental open could not find the slave
nodes, be crash.
Problem Summary:
### Release note
None
### Check List (For Author)
- Test <!-- At least one of them must be included. -->
- [x] Regression test
- [ ] Unit Test
- [ ] Manual test (add detailed scripts or steps below)
- [ ] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
- [ ] Previous test can cover this change.
- [ ] No code files have been changed.
- [ ] Other reason <!-- Add your reason? -->
- Behavior changed:
- [x] No.
- [ ] Yes. <!-- Explain the behavior change -->
- Does this need documentation?
- [x] No.
- [ ] Yes. <!-- Add document PR link here. eg:
https://github.com/apache/doris-website/pull/1214 -->
### Check List (For Reviewer who merge this PR)
- [ ] Confirm the release note
- [ ] Confirm test cases
- [ ] Confirm document
- [ ] Add branch pick label <!-- Add branch pick label that this PR should
merge into -->
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]