[ https://issues.apache.org/jira/browse/IMPALA-8212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16772744#comment-16772744 ]
Michael Ho commented on IMPALA-8212: ------------------------------------ Looking at the stack trace of the crash, it seems that the Kudu code made calls to some Kerberos code which made some modification to {{g_krb5_ctx}} inadvertently. As far as I understand, the assumption is that {{g_krb5_ctx}} is global, shared and it should not be modified after initialization. However, the default initialization code {{krb5_init_context(&g_krb5_ctx)}} called by {{kudu::security:: InitKrb5Ctx()}} only sets {{g_krb5_ctx->default_realm}} to 0. Upon the first call to {{krb5_parse_name()}}, the Kerberos library will call {{krb5_get_default_realm()}} to get the default relam as the Sasl client we created didn't actually take the Kerberos realm as argument. Apparently, {{krb5_get_default_realm}} may modify {{g_krb5_context}} and it's not thread safe. As shown in the stack trace and the code below, {{context->default_realm}} is most likely {{NULL}}. So, if multiple negotiation threads get into the same code path of calling {{krb5_get_default_realm()}} at the same time, they may end up stepping on each other and corrupting {{g_krb5_ctx}}, leading to the crash as we saw above or some error messages like the following: {noformat} 0216 14:26:07.459600 (+ 296us) negotiation.cc:304] Negotiation complete: Runtime error: Server connection negotiation failed: server connection from X.X.X.X:37070: could not canonicalize krb5 principal: could not parse principal: Configuration file does not specify default realm {noformat} [~tlipcon] kindly pointed out that someone reported similar issue in Kerberos upstream in the past (http://krbdev.mit.edu/rt/Ticket/Display.html?id=2855). {noformat} krb5_error_code KRB5_CALLCONV krb5_get_default_realm(krb5_context context, char **realm_out) { krb5_error_code ret; *realm_out = NULL; if (context == NULL || context->magic != KV5M_CONTEXT) return KV5M_CONTEXT; if (context->default_realm == NULL) { ret = get_default_realm(context, &context->default_realm); <<<----- // non-thread safe call if (ret) return ret; } *realm_out = strdup(context->default_realm); return (*realm_out == NULL) ? ENOMEM : 0; } {noformat} Stack trace showing {noformat} #30 <signal handler called> #31 0x00000000048d0a53 in tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int) () #32 0x00000000048d0aec in tcmalloc::ThreadCache::ListTooLong(tcmalloc::ThreadCache::FreeList*, unsigned long) () #33 0x0000000004a0b4c0 in tc_free () #34 0x00007fb03f051720 in profile_iterator_free () from sysroot/lib64/libkrb5.so.3 #35 0x00007fb03f0519a4 in profile_get_value () from sysroot/lib64/libkrb5.so.3 #36 0x00007fb03f051a18 in profile_get_string () from sysroot/lib64/libkrb5.so.3 #37 0x00007fb03f044dde in profile_default_realm () from sysroot/lib64/libkrb5.so.3 #38 0x00007fb03f044509 in krb5_get_default_realm () from sysroot/lib64/libkrb5.so.3 #39 0x00007fb03f0245e8 in krb5_parse_name_flags () from sysroot/lib64/libkrb5.so.3 #40 0x0000000001ff7bbf in kudu::security::CanonicalizeKrb5Principal(std::string*) () #41 0x00000000026ee4df in kudu::rpc::ServerNegotiation::AuthenticateBySasl(kudu::faststring*) () #42 0x00000000026ea929 in kudu::rpc::ServerNegotiation::Negotiate() () #43 0x000000000271035b in kudu::rpc::DoServerNegotiation(kudu::rpc::Connection*, kudu::TriStateFlag, kudu::TriStateFlag, kudu::MonoTime const&) () #44 0x000000000271070d in kudu::rpc::Negotiation::RunNegotiation(scoped_refptr<kudu::rpc::Connection> const&, kudu::TriStateFlag, kudu::TriStateFlag, kudu::MonoTime) () {noformat} > Crash during startup in kudu::security::CanonicalizeKrb5Principal() > ------------------------------------------------------------------- > > Key: IMPALA-8212 > URL: https://issues.apache.org/jira/browse/IMPALA-8212 > Project: IMPALA > Issue Type: Bug > Components: Backend > Affects Versions: Impala 3.2.0 > Environment: CentOS Linux release 7.4.1708 (Core) > Linux vc0512.halxg.cloudera.com 3.10.0-693.21.1.el7.x86_64 #1 SMP Wed Mar 7 > 19:03:37 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux > Reporter: Tim Armstrong > Assignee: Michael Ho > Priority: Blocker > Labels: crash > Attachments: gdb-core-60055.txt, gdb.txt, hs_err_pid60055.log, > hs_err_pid65365.log, > impalad.vc0512.halxg.cloudera.com.impala.log.INFO.20190218-140034.65365, > impalad.vc0513.halxg.cloudera.com.impala.log.INFO.20190216-142536.60055 > > > I saw this crash twice will working on the stress test. It *seems* to happen > when the stress infrastructure switches the service to a debug build, > restarts the service, then starts running queries. I haven't seen it happen > once the service is up and running for a while. > {noformat} > #0 0x00007fb03e1fa1f7 in raise () from sysroot/lib64/libc.so.6 > #1 0x00007fb03e1fb8e8 in abort () from sysroot/lib64/libc.so.6 > #2 0x00007fb041159185 in os::abort(bool) () from > sysroot/usr/java/jdk1.8.0_141-cloudera/jre/lib/amd64/server/libjvm.so > #3 0x00007fb0412fb593 in VMError::report_and_die() () from > sysroot/usr/java/jdk1.8.0_141-cloudera/jre/lib/amd64/server/libjvm.so > #4 0x00007fb04115e68f in JVM_handle_linux_signal () from > sysroot/usr/java/jdk1.8.0_141-cloudera/jre/lib/amd64/server/libjvm.so > #5 0x00007fb041154be3 in signalHandler(int, siginfo*, void*) () from > sysroot/usr/java/jdk1.8.0_141-cloudera/jre/lib/amd64/server/libjvm.so > #6 <signal handler called> > #7 0x00000000048d0a53 in > tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, > unsigned long, int) () > #8 0x00000000048d0aec in > tcmalloc::ThreadCache::ListTooLong(tcmalloc::ThreadCache::FreeList*, unsigned > long) () > #9 0x0000000004a0b4c0 in tc_free () > #10 0x00007fb040d32933 in ElfDecoder::demangle(char const*, char*, int) () > from sysroot/usr/java/jdk1.8.0_141-cloudera/jre/lib/amd64/server/libjvm.so > #11 0x00007fb040d3222a in Decoder::demangle(char const*, char*, int) () from > sysroot/usr/java/jdk1.8.0_141-cloudera/jre/lib/amd64/server/libjvm.so > #12 0x00007fb04115695d in os::dll_address_to_function_name(unsigned char*, > char*, int, int*) () from > sysroot/usr/java/jdk1.8.0_141-cloudera/jre/lib/amd64/server/libjvm.so > #13 0x00007fb040dc0222 in frame::print_C_frame(outputStream*, char*, int, > unsigned char*) () from > sysroot/usr/java/jdk1.8.0_141-cloudera/jre/lib/amd64/server/libjvm.so > #14 0x00007fb040d2e925 in print_native_stack(outputStream*, frame, Thread*, > char*, int) () from > sysroot/usr/java/jdk1.8.0_141-cloudera/jre/lib/amd64/server/libjvm.so > #15 0x00007fb0412f9cc8 in VMError::report(outputStream*) () from > sysroot/usr/java/jdk1.8.0_141-cloudera/jre/lib/amd64/server/libjvm.so > #16 0x00007fb0412fb18a in VMError::report_and_die() () from > sysroot/usr/java/jdk1.8.0_141-cloudera/jre/lib/amd64/server/libjvm.so > #17 0x00007fb04115e68f in JVM_handle_linux_signal () from > sysroot/usr/java/jdk1.8.0_141-cloudera/jre/lib/amd64/server/libjvm.so > #18 0x00007fb041154be3 in signalHandler(int, siginfo*, void*) () from > sysroot/usr/java/jdk1.8.0_141-cloudera/jre/lib/amd64/server/libjvm.so > #19 <signal handler called> > #20 0x00000000048d0a53 in > tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, > unsigned long, int) () > #21 0x00000000048d0aec in > tcmalloc::ThreadCache::ListTooLong(tcmalloc::ThreadCache::FreeList*, unsigned > long) () > #22 0x0000000004a0b4c0 in tc_free () > #23 0x00007fb03e5915dd in pthread_attr_destroy () from > sysroot/lib64/libpthread.so.0 > #24 0x00007fb04115e49f in current_stack_region(unsigned char**, unsigned > long*) () from > sysroot/usr/java/jdk1.8.0_141-cloudera/jre/lib/amd64/server/libjvm.so > #25 0x00007fb04115e535 in os::current_stack_base() () from > sysroot/usr/java/jdk1.8.0_141-cloudera/jre/lib/amd64/server/libjvm.so > #26 0x00007fb0412faeb4 in VMError::report(outputStream*) () from > sysroot/usr/java/jdk1.8.0_141-cloudera/jre/lib/amd64/server/libjvm.so > #27 0x00007fb0412fb18a in VMError::report_and_die() () from > sysroot/usr/java/jdk1.8.0_141-cloudera/jre/lib/amd64/server/libjvm.so > #28 0x00007fb04115e68f in JVM_handle_linux_signal () from > sysroot/usr/java/jdk1.8.0_141-cloudera/jre/lib/amd64/server/libjvm.so > #29 0x00007fb041154be3 in signalHandler(int, siginfo*, void*) () from > sysroot/usr/java/jdk1.8.0_141-cloudera/jre/lib/amd64/server/libjvm.so > #30 <signal handler called> > #31 0x00000000048d0a53 in > tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, > unsigned long, int) () > #32 0x00000000048d0aec in > tcmalloc::ThreadCache::ListTooLong(tcmalloc::ThreadCache::FreeList*, unsigned > long) () > #33 0x0000000004a0b4c0 in tc_free () > #34 0x00007fb03f051720 in profile_iterator_free () from > sysroot/lib64/libkrb5.so.3 > #35 0x00007fb03f0519a4 in profile_get_value () from sysroot/lib64/libkrb5.so.3 > #36 0x00007fb03f051a18 in profile_get_string () from > sysroot/lib64/libkrb5.so.3 > #37 0x00007fb03f044dde in profile_default_realm () from > sysroot/lib64/libkrb5.so.3 > #38 0x00007fb03f044509 in krb5_get_default_realm () from > sysroot/lib64/libkrb5.so.3 > #39 0x00007fb03f0245e8 in krb5_parse_name_flags () from > sysroot/lib64/libkrb5.so.3 > #40 0x0000000001ff7bbf in > kudu::security::CanonicalizeKrb5Principal(std::string*) () > #41 0x00000000026ee4df in > kudu::rpc::ServerNegotiation::AuthenticateBySasl(kudu::faststring*) () > #42 0x00000000026ea929 in kudu::rpc::ServerNegotiation::Negotiate() () > #43 0x000000000271035b in > kudu::rpc::DoServerNegotiation(kudu::rpc::Connection*, kudu::TriStateFlag, > kudu::TriStateFlag, kudu::MonoTime const&) () > #44 0x000000000271070d in > kudu::rpc::Negotiation::RunNegotiation(scoped_refptr<kudu::rpc::Connection> > const&, kudu::TriStateFlag, kudu::TriStateFlag, kudu::MonoTime) () > #45 0x00000000026ca8ab in kudu::internal::RunnableAdapter<void > (*)(scoped_refptr<kudu::rpc::Connection> const&, kudu::TriStateFlag, > kudu::TriStateFlag, > kudu::MonoTime)>::Run(scoped_refptr<kudu::rpc::Connection> const&, > kudu::TriStateFlag const&, kudu::TriStateFlag const&, kudu::MonoTime const&) > () > #46 0x00000000026c9bf4 in kudu::internal::InvokeHelper<false, void, > kudu::internal::RunnableAdapter<void (*)(scoped_refptr<kudu::rpc::Connection> > const&, kudu::TriStateFlag, kudu::TriStateFlag, ku---Type <return> to > continue, or q <return> to quit--- > du::MonoTime)>, void (kudu::rpc::Connection*, kudu::TriStateFlag const&, > kudu::TriStateFlag const&, kudu::MonoTime > const&)>::MakeItSo(kudu::internal::RunnableAdapter<void > (*)(scoped_refptr<kudu::rpc::Connection> const&, kudu::TriStateFlag, > kudu::TriStateFlag, kudu::MonoTime)>, kudu::rpc::Connection*, > kudu::TriStateFlag const&, kudu::TriStateFlag const&, kudu::MonoTime const&) > () > #47 0x00000000026c8ad3 in kudu::internal::Invoker<4, > kudu::internal::BindState<kudu::internal::RunnableAdapter<void > (*)(scoped_refptr<kudu::rpc::Connection> const&, kudu::TriStateFlag, > kudu::TriStateFlag, kudu::MonoTime)>, void > (scoped_refptr<kudu::rpc::Connection> const&, kudu::TriStateFlag, > kudu::TriStateFlag, kudu::MonoTime), void > (scoped_refptr<kudu::rpc::Connection>, kudu::TriStateFlag, > kudu::TriStateFlag, kudu::MonoTime)>, void > (scoped_refptr<kudu::rpc::Connection> const&, kudu::TriStateFlag, > kudu::TriStateFlag, kudu::MonoTime)>::Run(kudu::internal::BindStateBase*) () > #48 0x0000000001dae84c in kudu::Callback<void ()>::Run() const () > #49 0x000000000295a66a in kudu::ClosureRunnable::Run() () > #50 0x00000000029595fd in kudu::ThreadPool::DispatchThread() () > #51 0x00000000029650d5 in boost::_mfi::mf0<void, > kudu::ThreadPool>::operator()(kudu::ThreadPool*) const () > #52 0x0000000002964602 in void > boost::_bi::list1<boost::_bi::value<kudu::ThreadPool*> > >::operator()<boost::_mfi::mf0<void, kudu::ThreadPool>, > boost::_bi::list0>(boost::_bi::type<void>, boost::_mfi::mf0<void, > kudu::ThreadPool>&, boost::_bi::list0&, int) () > #53 0x0000000002963a05 in boost::_bi::bind_t<void, boost::_mfi::mf0<void, > kudu::ThreadPool>, boost::_bi::list1<boost::_bi::value<kudu::ThreadPool*> > > >::operator()() () > #54 0x0000000002962b61 in > boost::detail::function::void_function_obj_invoker0<boost::_bi::bind_t<void, > boost::_mfi::mf0<void, kudu::ThreadPool>, > boost::_bi::list1<boost::_bi::value<kudu::ThreadPool*> > >, > void>::invoke(boost::detail::function::function_buffer&) () > #55 0x0000000001d76514 in boost::function0<void>::operator()() const () > #56 0x0000000001d72da2 in kudu::Thread::SuperviseThread(void*) () > #57 0x00007fb03e58fe25 in start_thread () from sysroot/lib64/libpthread.so.0 > #58 0x00007fb03e2bd34d in clone () from sysroot/lib64/libc.so.6 > {noformat} > This was a downstream Cloudera build, but the code is the same as this > upstream commit: > {noformat} > Author: Andrew Sherman <asher...@cloudera.com> > Date: Tue Feb 12 16:17:13 2019 -0800 > IMPALA-8194: wait longer to detect JVM pause in TestPauseMonitor. > {noformat} > cc [~twm378] -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org