Hello Alexey Serbin, Kudu Jenkins, Adar Dembo, Todd Lipcon,

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/12170

to look at the new patch set (#6).

Change subject: client: avoid accessing state after scheduling retry
......................................................................

client: avoid accessing state after scheduling retry

I saw a failure of MasterMigrationTest.TestEndToEndMigration in which
the stack watchdog logged the following stacktrace:

User stack:
    @     0x7fa3e575c330  (unknown) at ??:0
    @           0x52dc09  __sanitizer::internal_read() at 
/data/8/awong/kudu/thirdparty/src/llvm-6.0.0.src/projects/compiler-rt/lib/sanitizer_common/sanitizer_syscall_linux_x86_64.inc:46
 (discriminator 7)
    @           0x52f79f  __sanitizer::ReadFromFile() at 
/data/8/awong/kudu/thirdparty/src/llvm-6.0.0.src/projects/compiler-rt/lib/sanitizer_common/sanitizer_posix.cc:176
    @           0x53acb9  __sanitizer::SymbolizerProcess::ReadFromSymbolizer() 
at 
/data/8/awong/kudu/thirdparty/src/llvm-6.0.0.src/projects/compiler-rt/lib/sanitizer_common/sanitizer_symbolizer_libcdep.cc:468
    @           0x53ba76  __sanitizer::SymbolizerProcess::SendCommand() at 
/data/8/awong/kudu/thirdparty/src/llvm-6.0.0.src/projects/compiler-rt/lib/sanitizer_common/sanitizer_symbolizer_libcdep.cc:445
    @           0x53c3c5  __sanitizer::Symbolizer::SymbolizePC() at 
/data/8/awong/kudu/thirdparty/src/llvm-6.0.0.src/projects/compiler-rt/lib/sanitizer_common/sanitizer_symbolizer_libcdep.cc:356
    @           0x539673  __sanitizer::StackTrace::Print() at 
/data/8/awong/kudu/thirdparty/src/llvm-6.0.0.src/projects/compiler-rt/lib/sanitizer_common/sanitizer_stacktrace_libcdep.cc:36
    @           0x541c44  MaybePrintStackTrace() at 
/data/8/awong/kudu/thirdparty/src/llvm-6.0.0.src/projects/compiler-rt/lib/ubsan/ubsan_diag.cc:48
    @           0x5422ee  __ubsan::ScopedReport::~ScopedReport() at 
/data/8/awong/kudu/thirdparty/src/llvm-6.0.0.src/projects/compiler-rt/lib/ubsan/ubsan_diag.cc:73
    @           0x549608  HandleDynamicTypeCacheMiss() at 
/data/8/awong/kudu/thirdparty/src/llvm-6.0.0.src/projects/compiler-rt/lib/ubsan/ubsan_handlers_cxx.cc:81
    @           0x549a43  __ubsan_handle_dynamic_type_cache_miss_abort at 
/data/8/awong/kudu/thirdparty/src/llvm-6.0.0.src/projects/compiler-rt/lib/ubsan/ubsan_handlers_cxx.cc:93
    @     0x7fa3f0086643  
_ZZN4kudu6client8internal20AsyncLeaderMasterRpcINS_6master23GetTableSchemaRequestPBENS3_24GetTableSchemaResponsePBEE27RetryOrReconnectIfNecessaryEPNS_6StatusEENKUlvE_clEv
 at ??:0
    @     0x7fa3f0059b8c  
_ZN4kudu13ScopedCleanupIZNS_6client8internal20AsyncLeaderMasterRpcINS_6master23GetTableSchemaRequestPBENS4_24GetTableSchemaResponsePBEE27RetryOrReconnectIfNecessaryEPNS_6StatusEEUlvE_ED2Ev
 at ??:0
    @     0x7fa3f005945c  
kudu::client::internal::AsyncLeaderMasterRpc<>::RetryOrReconnectIfNecessary() 
at ??:0
    @     0x7fa3f0057d0a  
kudu::client::internal::AsyncLeaderMasterRpc<>::SendRpcCb() at ??:0
    @     0x7fa3f0085a85  
_ZNSt5_BindIFSt7_Mem_fnIMN4kudu6client8internal20AsyncLeaderMasterRpcINS1_6master23GetTableSchemaRequestPBENS5_24GetTableSchemaResponsePBEEEFvRKNS1_6StatusEEEPS8_S9_EE6__callIvJEJLm0ELm1EEEET_OSt5tupleIJDpT0_EESt12_Index_tupleIJXspT1_EEE
 at ??:0
    #19 0x7fa3ddc84ffc in clone sysdeps/unix/sysv/linux/x86_64/clone.S:111

Upon looking into this and similar failures, it seems like the following
series of events is possible:

T1: allocate an AsyncLeaderMasterRpc on the stack with
    Synchronizer.Wait() as the user-specified callback
T2: the RPC call results in an error that makes it retry via
    RetryOrReconnectIfNecessary(); a retry is scheduled
T3: the retry completes and calls the user-specified callback to begin
    execution on T1
T1: with the RPC call completed, destroys RPC object
T2: the KLOG messages attempt to access state from the destroyed RPC
    object and hit a crash, race, undefined behavior, etc.

The fix is to make a copy of the RPC's state that was to be logged so
there's no chance that T1's destruction of the RPC will affect T2. I
tested this by looping a test[1] that repeatedly called
Client::IsCreateTableInProgress() on a multi-master cluster configured
to have election storms, and observing 5/1000 that yielded TSAN errors
around the logging calls in RetryOrReconnectIfNecessary. With this
patch, 1000/1000 pass.

[1] https://gist.github.com/andrwng/5d552f75a2e0d671b7ed54dd01892c66

Change-Id: I8cefd9613018247a1a25d17adedc021e8be166f6
---
M src/kudu/client/master_proxy_rpc.cc
1 file changed, 8 insertions(+), 4 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/70/12170/6
--
To view, visit http://gerrit.cloudera.org:8080/12170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8cefd9613018247a1a25d17adedc021e8be166f6
Gerrit-Change-Number: 12170
Gerrit-PatchSet: 6
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <a...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <aser...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Todd Lipcon <t...@apache.org>

Reply via email to