[ 
https://issues.apache.org/jira/browse/KUDU-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17000454#comment-17000454
 ] 

Todd Lipcon commented on KUDU-3030:
-----------------------------------

I investigated this a bit. We do already install libunwind, and use that for 
things like '/stacks' and glog. However, it seems like tcmalloc has the 
following behavior:

- the autoconf script tries to check whether frame pointers are omitted by 
default (as they usually are on x86)
- it also checks if libunwind is present, and if so, compiles a libunwind-based 
stack walker
- at runtime, if both are present, it will prefer libunwind on systems where it 
has detected that frame pointers are omitted.

So, in theory, since we do install libunwind before building tcmalloc in 
thirdparty, we should be selecting libunwind by default. However, we set 
CXXFLAGS to '-fno-omit-frame-pointer' in our thirdparty build, which actually 
affects the {{configure}} script as well. So, when it tried to check whether 
frame pointers were omitted by default, it decided that they were _not_ 
omitted, and thus configured itself to prefer the fp-based unwinder.

A couple ways we can fix this:
(1) stop compiling tcmalloc with -fno-omit-frame-pointer. This should get it to 
prefer libunwind.
(2) add some capability in tcmalloc's configuration to force it to use 
libunwind even when built with no frame pointers of its own.
(3) at runtime, it seems like we could set TCMALLOC_STACKTRACE_METHOD=libunwind 
early at startup, and it would prefer libunwind.

If we find that the libunwind-based unwinder is too slow for use in heap 
sampling use case, we could also try to patch tcmalloc's FP unwinder to be more 
safe. One approach is to call write() on each address before reading it, since 
write() will return -EFAULT instead of crashing if the address is bad. Another 
approach would be to set a threadlocal while we're in the stack trace routine, 
and if we catch a SEGV with this threadlocal set, we could ignore it and abort 
the stack tracing.


> Crash in tcmalloc stack unwinder
> --------------------------------
>
>                 Key: KUDU-3030
>                 URL: https://issues.apache.org/jira/browse/KUDU-3030
>             Project: Kudu
>          Issue Type: Bug
>          Components: build
>    Affects Versions: 1.11.0
>            Reporter: Todd Lipcon
>            Priority: Critical
>
> We recently saw a crash where the tcmalloc heap profiler was trying to unwind 
> the stack, and ended up accessing invalid memory. The issue here is that 
> tcmalloc is relying on frame pointers for heap unwinding, but this particular 
> stack trace was going through libstdc++, which was installed on the system 
> and doesn't have frame pointers. "usually" this works OK, but when we get 
> unlucky, we can crash.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to