Ping!

On 11/7/19 2:01 PM, Chris Plummer wrote:
Hi,

Please review the following fix for JDK-8231635:

https://bugs.openjdk.java.net/browse/JDK-8231635
http://cr.openjdk.java.net/~cjplummer/8231635/webrev.00/

I've tried to explain below to the best of my ability what's is going on, but keep in mind that I basically had no background in this area before looking into this CR, so this is all new to me. Please feel free to chime in with corrections to my explanation, or any additional insight that might help to further understanding of this code.

When doing a thread stack dump, SA has to figure out the SP for the current frame when it may not in fact be stored anywhere. So it goes through a series of guesses, starting with the current value of SP. See AMD64CurrentFrameGuess.run():

    Address sp  = context.getRegisterAsAddress(AMD64ThreadContext.RSP);

There are a number of checks done to see if this is the SP for the actual current frame, one of the checks being (and kind of a last resort) to follow the frame links and see if they eventually lead to the first entry frame:

            while (frame != null) {
              if (frame.isEntryFrame() && frame.entryFrameIsFirst()) {
                 ...
                 return true;
              }
              frame = frame.sender(map);
            }

If this fails, there is an outer loop to try the next address:

        for (long offset = 0;
             offset < regionInBytesToSearch;
             offset += vm.getAddressSize()) {

Note that offset is added to the initial SP value that was fetched from RSP. This approach is fraught with danger, because SP could be incorrect, and you can easily follow a bad frame link to an invalid address. So the body of this loop is in a try block that catches all Exceptions, and simply retries with the next offset if one is caught. Exceptions could be ones like UnalignedAddressException or UnmappedAddressException.

The bug in question turns up with the following harmless looking line:

              frame = frame.sender(map);

This is fine if you know that "frame" is valid, but what if it is not (which is very commonly the case). The frame values (SP, FP, and PC) in the returned frame could be just about anything, including being the same as the previous frame. This is what will happen if the SP stored in "frame" is the same as the SP that was used to initialize "frame" in the first place. This can certainly happen when SP is not valid to start with, and is indeed what caused this bug. The end result is the inner while loop gets stuck in an infinite loop traversing the same frame. So the fix is to add a check for this to make sure to break out of the while loop if this happens. Initially I did this with an Address.equal() call, and that seemed to fix the problem, but then I realized it would be possible to traverse through one or more sender frames and eventually end up returning to a previously visited frame, thus still an infinite loop. So I decided on checking for Address.lessThanOrEqual() instead since the send frame's SP should always be greater than the current frame's (referred to as oldFrame) SP. As long as we always move in one direction (towards a higher frame address), you can't have an infinite loop in this code.

I applied this fix to x86. Although not tested, it is built (all platform support is always built with SA). The x86 and amd64 versions are identical except for x86/amd64 references, so I thought it best to go ahead and do the update to x86. I did not touch ppc, but would be willing to update if someone passes along a fix that is tested.

One final bit of clarification. The bug synopsis mentions getting stuck in BasicTypeDataBase.findDynamicTypeForAddress(). This turns out to not actually be the case, but every stack trace I initially looked when I filed this CR was showing the thread being in this frame and at the same line number. This appears to be the next available safepoint where the thread can be suspended for stack dumping. When debugging this some more and adding a lot of println() calls in a lot of different locations, I started to see different frames in the stacktrace, presumably because the println() calls where adding additional safepoints.

thanks,

Chris



Reply via email to