Re: RFR: 8242181: [Linux] Show source information when printing native stack traces in hs_err files [v5]

Thomas Stuefe Mon, 28 Mar 2022 06:20:03 -0700

On Mon, 28 Feb 2022 16:22:25 GMT, Christian Hagedorn <chaged...@openjdk.org> 
wrote:


>> When printing the native stack trace on Linux (mostly done for hs_err 
>> files), it only prints the method with its parameters and a relative offset 
>> in the method:
>> 
>> Stack: [0x00007f6e01739000,0x00007f6e0183a000],  sp=0x00007f6e01838110,  
>> free space=1020k
>> Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native 
>> code)
>> V  [libjvm.so+0x620d86]  Compilation::~Compilation()+0x64
>> V  [libjvm.so+0x624b92]  Compiler::compile_method(ciEnv*, ciMethod*, int, 
>> bool, DirectiveSet*)+0xec
>> V  [libjvm.so+0x8303ef]  
>> CompileBroker::invoke_compiler_on_method(CompileTask*)+0x899
>> V  [libjvm.so+0x82f067]  CompileBroker::compiler_thread_loop()+0x3df
>> V  [libjvm.so+0x84f0d1]  CompilerThread::thread_entry(JavaThread*, 
>> JavaThread*)+0x69
>> V  [libjvm.so+0x1209329]  JavaThread::thread_main_inner()+0x15d
>> V  [libjvm.so+0x12091c9]  JavaThread::run()+0x167
>> V  [libjvm.so+0x1206ada]  Thread::call_run()+0x180
>> V  [libjvm.so+0x1012e55]  thread_native_entry(Thread*)+0x18f
>> 
>> This makes it sometimes difficult to see where exactly the methods were 
>> called from and sometimes almost impossible when there are multiple 
>> invocations of the same method within one method.
>> 
>> This patch improves this by providing source information (filename + line 
>> number) to the native stack traces on Linux similar to what's already done 
>> on Windows (see 
>> [JDK-8185712](https://bugs.openjdk.java.net/browse/JDK-8185712)):
>> 
>> Stack: [0x00007f34fca18000,0x00007f34fcb19000],  sp=0x00007f34fcb17110,  
>> free space=1020k
>> Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native 
>> code)
>> V  [libjvm.so+0x620d86]  Compilation::~Compilation()+0x64  
>> (c1_Compilation.cpp:607)
>> V  [libjvm.so+0x624b92]  Compiler::compile_method(ciEnv*, ciMethod*, int, 
>> bool, DirectiveSet*)+0xec  (c1_Compiler.cpp:250)
>> V  [libjvm.so+0x8303ef]  
>> CompileBroker::invoke_compiler_on_method(CompileTask*)+0x899  
>> (compileBroker.cpp:2291)
>> V  [libjvm.so+0x82f067]  CompileBroker::compiler_thread_loop()+0x3df  
>> (compileBroker.cpp:1966)
>> V  [libjvm.so+0x84f0d1]  CompilerThread::thread_entry(JavaThread*, 
>> JavaThread*)+0x69  (compilerThread.cpp:59)
>> V  [libjvm.so+0x1209329]  JavaThread::thread_main_inner()+0x15d  
>> (thread.cpp:1297)
>> V  [libjvm.so+0x12091c9]  JavaThread::run()+0x167  (thread.cpp:1280)
>> V  [libjvm.so+0x1206ada]  Thread::call_run()+0x180  (thread.cpp:358)
>> V  [libjvm.so+0x1012e55]  thread_native_entry(Thread*)+0x18f  
>> (os_linux.cpp:705)
>> 
>> For Linux, we need to parse the debug symbols which are generated by GCC in 
>> DWARF - a standardized debugging format. This patch adds support for DWARF 
>> 4, the default of GCC 10.x, for 32 and 64 bit architectures (tested with 
>> x86_32, x86_64 and AArch64). DWARF 5 is not supported as it was still 
>> experimental and not generated for HotSpot. However, newer GCC version may 
>> soon generate DWARF 5 by default in which case this parser either needs to 
>> be extended or the build of HotSpot configured to only emit DWARF 4. 
>> 
>> The code follows the parsing steps described in the official DWARF 4 spec: 
>> https://dwarfstd.org/doc/DWARF4.pdf
>> I added references to the corresponding sections throughout the code. 
>> However, I tried to explain the steps from the DWARF spec directly in the 
>> code (method names, comments etc.). This allows to follow the code without 
>> the need to actually deep dive into the spec. 
>> 
>> The comments at the `Dwarf` class in the `elf.hpp` file explain in more 
>> detail how a DWARF file is structured and how the parsing algorithm works to 
>> get to the filename and line number information. There are more class 
>> comments throughout the `elf.hpp` file about how different DWARF sections 
>> are structured and how the parsing algorithm needs to fetch the required 
>> information. Therefore, I will not repeat the exact workings of the 
>> algorithm here but refer to the code comments. I've tried to add as much 
>> information as possible to improve the readability.
>> 
>> Generally, I've tried to stay away from adding any assertions as this code 
>> is almost always executed when already processing a VM error. Instead, the 
>> DWARF parser aims to just exit gracefully and possibly omit source 
>> information for a stack frame instead of risking to stop writing the hs_err 
>> file when an assertion would have failed. To debug failures, `-Xlog:dwarf` 
>> can be used with `info`, `debug` or `trace` which provides logging messages 
>> throughout parsing. 
>> 
>> **Testing:**
>> Apart from manual testing, I've added two kinds of tests:
>> - A JTreg test: Spawns new VMs to let them crash in various ways. The test 
>> reads the created hs_err files to check if the DWARF parsing could correctly 
>> find the filename and line number. For normal HotSpot files, I could not 
>> check against hardcoded filenames and line numbers as they are subject to 
>> change (especially line number can quickly become different). I therefore 
>> just added some sanity checks in the form of "found a non-empty file" and 
>> "found a non-zero line number". On top of that, I added tests that let the 
>> VM crash in custom C files (which will not change). This enables an 
>> additional verification of hardcoded filenames and line numbers.
>> - Gtests: Directly calling the `get_source()` method which initiates DWARF 
>> parsing. Tested some special cases, for example, having a buffer that is not 
>> big enough to store the filename.
>> 
>> On top of that, there are also existing JTreg tests that call 
>> `-XX:NativeMemoryTracking=detail` which will print a native stack trace with 
>> the new source information. These tests were also run as part of the 
>> standard tier testing and can be considered as sanity tests for this 
>> implementation.
>> 
>> To make tests work in our infrastructure or if some other setups want to 
>> have debug symbols at different locations, I've added support for an 
>> additional  `_JVM_DWARF_PATH` environment variable. This variable can 
>> specify a path from which the DWARF symbol file should be read by the parser 
>> if the default locations do not contain debug symbols (required some `make` 
>> changes). This is similar to what's done on Windows with `_NT_SYMBOL_PATH`. 
>> The JTreg test, however, also works if there are no symbols available. In 
>> that case, the test just skips all the assertion checks for the filename and 
>> line number.
>> 
>> I haven't run any specific performance testing as this new code is mainly 
>> executed when an error will exit the VM and only if symbol files are 
>> available (which is normally not the case when using Java release builds as 
>> a user).
>> 
>> Special thanks to @tschatzl for giving me some pointers to start based on 
>> his knowledge from a DWARF 2 parser he once wrote in Pascal and for 
>> discussing approaches on how to retrieve the source information and to 
>> @erikj79 for providing help for the changes required for `make`!
>>  
>> Thanks,
>> Christian
>
> Christian Hagedorn has updated the pull request with a new target base due to 
> a merge or a rebase. The pull request now contains 54 commits:
> 
>  - Updating some comments
>  - Cleanup loading dwarf file and add summary
>  - Review comments of first pass by Thomas except dwarf file loading
>  - Merge branch 'master' into JDK-8242181
>  - Make dwarf tag NOT_PRODUCT
>  - Change log_* to log_develop_* and log_warning to log_develop_info
>  - Update test/hotspot/jtreg/runtime/ErrorHandling/TestDwarf.java
>    
>    Co-authored-by: Erik Joelsson <37597443+erik...@users.noreply.github.com>
>  - Update test/hotspot/jtreg/runtime/ErrorHandling/TestDwarf.java
>    
>    Co-authored-by: Erik Joelsson <37597443+erik...@users.noreply.github.com>
>  - Better formatting of trace output
>  - some code move and more cleanups
>  - ... and 44 more: 
> https://git.openjdk.java.net/jdk/compare/efd3967b...5bea4841

Hi Christian,

this is impressive work. It's a big change, and I had a look at part of it. 
I'll continue tomorrow.

In general, I'm concerned with the use of both UL and ResourceArea in this 
code. I know the use of UL has been discussed, but still. Use of RA will 
prevent us from getting useful callstacks if we crash and Thread::current is 
NULL or invalid. I'd feel better if we were to consistently rely on an outside 
scratch buffer (like we usually do in error reporting). Even raw ::malloc would 
be better IMHO.

Another concern was safety, since this is a potential attack vector with 
manipulated Dwarf files, if someone manages to provove a crash. Maybe far 
fetched, but still. Would be good to get SonarCloud readings for this code e.g.

More remarks inline.

Cheers, Thomas

src/hotspot/os/windows/decoder_windows.cpp line 39:

> 37: 
> 38: bool Decoder::get_source_info(address pc, char* filename, size_t 
> filename_len, int* line, bool is_pc_after_call) {
> 39:   return SymbolEngine::get_source_info(pc, filename, filename_len, line);

Are all these renaming changes on Windows necessary? Would be easier to review 
without, and easier to backport. That could be done in a separate RFE if needed.

src/hotspot/share/utilities/decoder.hpp line 120:

> 118:   // file name. File name will be silently truncated if output buffer is 
> too small.
> 119:   // If is_pc_after_call is true, then pc is treated as pointing to the 
> next instruction
> 120:   // after a call. The source information for the call instruction is 
> fetched in that case.

I tried to understand how this is used. In NMT, you set it depending on whether 
the PC belongs to the lowest frame. But that frame not really the lowest frame 
-  `NativeCallStack()` -> `os::get_native_stack` skips the first n frames. I 
don't understand what makes the first non-skipped frame different from the 
others.

src/hotspot/share/utilities/decoder_elf.cpp line 59:

> 57:   if (filename == nullptr || filename_len <= 0 || line == nullptr) {
> 58:     return false;
> 59:   }

I'd just assert here.

src/hotspot/share/utilities/elfFile.cpp line 119:

> 117:     return;
> 118:   }
> 119:   strcpy(_filepath, filepath);

This whole section could be shortened by using os::strdup()

src/hotspot/share/utilities/elfFile.cpp line 132:

> 130:   if (_filepath != NULL) {
> 131:     os::free((void*)_filepath);
> 132:   }

Since you removed the NULL checks in similar places, you could remove it here 
too

src/hotspot/share/utilities/elfFile.cpp line 136:

> 134:   delete _shdr_string_table;
> 135:   _shdr_string_table = nullptr;
> 136:   delete _next;

Recursive delete; may run out of stack, especially if this runs in error 
reporting. Who knows how much stack we have. Before your patch this may have 
been optimized away with TCO, but not anymore.

src/hotspot/share/utilities/elfFile.cpp line 289:

> 287:   return sect_index;
> 288: }
> 289: 

This was introduced by Volker as part of JDK-8019929. It is still used in PPC 
coding (see above). Any reason you removed this? You may get failing builds on 
PPC.

src/hotspot/share/utilities/elfFile.cpp line 309:

> 307: bool ElfFile::get_source_info(const uint32_t offset_in_library, char* 
> filename, const size_t filename_len, int* line, bool is_pc_after_call) {
> 308:   ResourceMark rm;
> 309:   // (1)

Wheres (2)? :)

src/hotspot/share/utilities/elfFile.cpp line 324:

> 322: 
> 323:   // Store result in filename and line pointer.
> 324:   if (!_dwarf_file->get_filename_and_line_number(offset_in_library, 
> *filename, filename_len, *line, is_pc_after_call)) {

Could we stick with pointer syntax here instead of switching to refs, since 
this is what we do in the other places?

src/hotspot/share/utilities/elfFile.cpp line 350:

> 348: 
> 349:   const size_t dwarf_filepath_len = strlen(dwarf_filename) + 
> strlen(_filepath) + strlen(".debug/")
> 350:                                     + strlen(usr_lib_debug_directory()) 
> + 2;

unrelated to your patch, but I'd just wire up usr_lib_debug_directory as a 
constant.

src/hotspot/share/utilities/elfFile.cpp line 357:

> 355:   }
> 356: 
> 357:   DwarfFilePath dwarf_file_path(dwarf_filename, dwarf_filepath_buf, 
> dwarf_filepath_len);

I needed a while to comprehend this. So, `dwarf_filename` is actually not just 
the name but the whole .debuginfo section. Our intent is to read the whole 
section - name  + padding + CRC - and pass it into DwarfFilePath, which 
extracts the file name and the CRC checksum, right?

What makes me itchy is that we have no guarantee and no checks that we actually 
read the whole section, that the file name is actually zero-terminated, as it 
is supposed to be, and that, when accessing the supposed crc, we are still 
within whatever we allocated.

If I understood correctly: I would prefer a better naming for dwarf_filename 
(proposal: "debuginfo_section"). Maybe a different type, since "const char*" is 
misleading. And I would rename `get_dwarf_filename()` similarly, and let it 
return the section size too, and verify that size later when extracting the 
CRC, and also verify that the name is zero terminated. 

E.g. to defend against malicious attacks with manipulated debug infos.

src/hotspot/share/utilities/elfFile.cpp line 378:

> 376:   }
> 377: 
> 378:   char* debug_filename = NEW_RESOURCE_ARRAY(char, shdr.sh_size);

With things read from outside, I would sanity check the size. E.g. give it a 
reasonable max like either file size or, Idk, 512M or so.

src/hotspot/share/utilities/elfFile.cpp line 450:

> 448:   if (buf == nullptr) {
> 449:     return false;
> 450:   }

I'd move this close to and local to where it is used.

Also, you seem to repeat the same pattern a lot "NEW_RESOURCE_ARRAY(n), if 
error return something". I'd factor this out to an utility function or utility 
macro, maybe one where you pass the error return value as macro parameter.

src/hotspot/share/utilities/elfFile.cpp line 452:

> 450:   }
> 451: 
> 452:   ElfStringTable* const table = _shdr_string_table;

You only use this in one place, I'd probably just use _shdr_string_table below.

src/hotspot/share/utilities/elfFile.cpp line 464:

> 462:     if (table->string_at(hdr.sh_name, buf, len)) {
> 463:       if (strncmp(buf, name, len) == 0) {
> 464:         return true;

Would things like this be worth a debug assert? Since this seems to indicate 
either a corrupted dwarf file or something is wrong with our understanding how 
dwarf works.

src/hotspot/share/utilities/elfFile.cpp line 546:

> 544:     // Must be equal, otherwise the file is corrupted.
> 545:     return create_new_dwarf_file(filepath);
> 546:   }

Since you bail out usually on conditions, I'd reverse the logic here too, "if 
CRC invalid then log warning and return false".

src/hotspot/share/utilities/elfFile.hpp line 212:

> 210:     const char* _filename;
> 211:     char* _path;
> 212:     const size_t _path_len;

I'd make an explicit comment that _path_len is supposed to include term. zero. 
I'd also rename it to something like _out and _out_len, would IMHO be clearer 
to the casual reader.

src/hotspot/share/utilities/elfFile.hpp line 217:

> 215:    public:
> 216:     DwarfFilePath(const char* filename, char* buf, size_t buf_len) : 
> _filename(filename), _path(buf), _path_len(buf_len) {
> 217:       size_t offset = (strlen(filename) + 4) >> 2u;

Why not use align()? Would be more readable.

src/hotspot/share/utilities/elfFile.hpp line 234:

> 232: 
> 233:     void set(const char* src) {
> 234:       strncpy(_path, src, _path_len);

Won't zero-terminate if srclen >= _path_len. To be sure, I'd use 
`jio_snprintf(_path, _pathlen, "%s", src);`. Or somesuch.

src/hotspot/share/utilities/elfFile.hpp line 235:

> 233:     void set(const char* src) {
> 234:       strncpy(_path, src, _path_len);
> 235:     }

I may be paranoid, but I would add a canary at the end of the output buffer, 
debug only, and check that in ~DwarfFilePath. Either that or use os::malloc. 
With os::malloc(), you get to use NMT and get overwrite recognition for free 
(we recently reworked that and it works more reliably now, and in release too).

src/hotspot/share/utilities/elfFile.hpp line 243:

> 241:     void set_after_last_slash(const char* src) {
> 242:       char* last_slash = strrchr(_path, '/');
> 243:       strncpy(last_slash + 1, src, _path_len);

This does not guard against overwrites. _path_len is the wrong length to use 
here since its the total length, not the last part. Also strncpy does not 
truncate. Also maybe verify that there is an actual slash.

Proposal:


char* const last_slash = strrchr(_path, '/');
if (last_slash) {
  size_t remaining = _path_len - (last_slash - _path) - 1;
  if (remaining > 0) {
    jio_snprintf(last_slash + 1, remaining, "%s", src);
  }
}

src/hotspot/share/utilities/elfFile.hpp line 247:

> 245: 
> 246:     void append(const char* src) {
> 247:       strncat(_path, src, _path_len);

strncat "Appends the first num characters of source to destination, plus a 
terminating null-character."

So, it should be _path_len - 1, otherwise the terminating zero may overwrite 
the buffer end

-------------

Changes requested by stuefe (Reviewer).

PR: https://git.openjdk.java.net/jdk/pull/7126

Re: RFR: 8242181: [Linux] Show source information when printing native stack traces in hs_err files [v5]

Reply via email to