Hi Chris,
This seems to be a good fix to have in any case.
This check and bail out is right thing to do and should not break anything.
I understand, this also fixes the test failures.
I only had some experience a long time ago with the support of pstack
and DTrace jstack action implementation which also does such SP
recovering because the ebp can be used by JIT compiler as a general
purpose register. There is no such a problem on sparc.
Thanks,
Serguei
On 11/7/19 14:01, Chris Plummer wrote:
Hi,
Please review the following fix for JDK-8231635:
https://bugs.openjdk.java.net/browse/JDK-8231635
http://cr.openjdk.java.net/~cjplummer/8231635/webrev.00/
I've tried to explain below to the best of my ability what's is going
on, but keep in mind that I basically had no background in this area
before looking into this CR, so this is all new to me. Please feel
free to chime in with corrections to my explanation, or any additional
insight that might help to further understanding of this code.
When doing a thread stack dump, SA has to figure out the SP for the
current frame when it may not in fact be stored anywhere. So it goes
through a series of guesses, starting with the current value of SP.
See AMD64CurrentFrameGuess.run():
Address sp = context.getRegisterAsAddress(AMD64ThreadContext.RSP);
There are a number of checks done to see if this is the SP for the
actual current frame, one of the checks being (and kind of a last
resort) to follow the frame links and see if they eventually lead to
the first entry frame:
while (frame != null) {
if (frame.isEntryFrame() && frame.entryFrameIsFirst()) {
...
return true;
}
frame = frame.sender(map);
}
If this fails, there is an outer loop to try the next address:
for (long offset = 0;
offset < regionInBytesToSearch;
offset += vm.getAddressSize()) {
Note that offset is added to the initial SP value that was fetched
from RSP. This approach is fraught with danger, because SP could be
incorrect, and you can easily follow a bad frame link to an invalid
address. So the body of this loop is in a try block that catches all
Exceptions, and simply retries with the next offset if one is caught.
Exceptions could be ones like UnalignedAddressException or
UnmappedAddressException.
The bug in question turns up with the following harmless looking line:
frame = frame.sender(map);
This is fine if you know that "frame" is valid, but what if it is not
(which is very commonly the case). The frame values (SP, FP, and PC)
in the returned frame could be just about anything, including being
the same as the previous frame. This is what will happen if the SP
stored in "frame" is the same as the SP that was used to initialize
"frame" in the first place. This can certainly happen when SP is not
valid to start with, and is indeed what caused this bug. The end
result is the inner while loop gets stuck in an infinite loop
traversing the same frame. So the fix is to add a check for this to
make sure to break out of the while loop if this happens. Initially I
did this with an Address.equal() call, and that seemed to fix the
problem, but then I realized it would be possible to traverse through
one or more sender frames and eventually end up returning to a
previously visited frame, thus still an infinite loop. So I decided on
checking for Address.lessThanOrEqual() instead since the send frame's
SP should always be greater than the current frame's (referred to as
oldFrame) SP. As long as we always move in one direction (towards a
higher frame address), you can't have an infinite loop in this code.
I applied this fix to x86. Although not tested, it is built (all
platform support is always built with SA). The x86 and amd64 versions
are identical except for x86/amd64 references, so I thought it best to
go ahead and do the update to x86. I did not touch ppc, but would be
willing to update if someone passes along a fix that is tested.
One final bit of clarification. The bug synopsis mentions getting
stuck in BasicTypeDataBase.findDynamicTypeForAddress(). This turns out
to not actually be the case, but every stack trace I initially looked
when I filed this CR was showing the thread being in this frame and at
the same line number. This appears to be the next available safepoint
where the thread can be suspended for stack dumping. When debugging
this some more and adding a lot of println() calls in a lot of
different locations, I started to see different frames in the
stacktrace, presumably because the println() calls where adding
additional safepoints.
thanks,
Chris