On 6/24/20 6:53 PM, Yasumasa Suenaga wrote:
On 2020/06/25 10:00, Chris Plummer wrote:
On 6/24/20 5:17 PM, Yasumasa Suenaga wrote:
On 2020/06/25 3:22, Chris Plummer wrote:
On 6/24/20 12:01 AM, Yasumasa Suenaga wrote:
On 2020/06/24 15:32, Chris Plummer wrote:
Hi Yasumasa ,
I think LinuxAMD64CFrame is used for pstack and what I've been
looking at has been jstack, and in particular
AMD64CurrentFrameGuess, which does use "last java frame". I'm not
sure why LinuxAMD64CFrame does not look at "last java frame".
Maybe it should.
I thought both pattern (jstack, mixed stack) for this change.
As you know, mixed jstack (jstack --mixed) attempt to find top of
native stack via LinuxAMD64CFrame, register values are needed for
it (so it depends on ptrace() call). So I guess mixed mode jstack
(jhsdb jstack --mixed) would not show any stacks (cannot find
"last java frame").
Hi Yasumasa,
I should have been more clear on what I meant by jstack and pstack.
For jstack I meant using StackTrace.java, which is what you get by
default with "jhsdb jstack" and also the clhsdb jstack command. For
pstack I meant PStack.java, which is what you get with "jhsdb
jstack --mixed" or the clhsdb pstack command.
So this CR impacts both types of stack traces in that they will get
null registers when the the lower level API fails to get the
register set. For StackTrace.java it will then defer to "last java
frame" if available. For PStack.java it will not, and will always
result in no stack trace. The code of interest is here:
AMD64ThreadContext context = (AMD64ThreadContext)
thread.getContext();
Address pc =
context.getRegisterAsAddress(AMD64ThreadContext.RIP);
if (pc == null) return null;
return LinuxAMD64CFrame.getTopFrame(dbg, pc, context);
So the question is should "last java frame" be used if pc == null.
If so, then getTopFrame() would also need to be modified to use
"last java frame" when fetching RBP.
I don't think so because CFrame is defined as "Models a "C"
programming language frame on the stack" in the javadoc, so it
should have *valid* register values IMHO.
In addition, RIP is needed for Linux AMD64 at least because it would
use DWARF since JDK-8234624.
Hi Yasumasa,
I don't quite understand the "C" frame nomenclature since CFrame is
used for non C frames also. The PStack code roughly does the following:
CFrame f = cdbg.topFrameForThread();
ClosestSymbol sym = f.closestSymbolToPC();
Address pc = f.pc();
if (sym != null) {
... native symbol
} else if (interp.contains(pc)) {
... print interpreter frame
So if the CFrame was filled in with "last java frame" values, it
should allow PStack to print the stack starting with the "last java
frame". Any native frame below that point would be missed.
To use "last java frame" in this case looks good because stack
unwinding is a best effort behavior.
However PStack::run is PC-driven. I want to regard it - in other
words, it should not perform if we cannot get register values even if
"last java frame" is available.
Ok, that sounds reasonable.
thanks,
Chris
Thanks,
Yasumasa
Chris
Thanks,
Yasumasa
thanks,
Chris
Thanks,
Yasumasa
thanks,
Chris
On 6/23/20 11:04 PM, Yasumasa Suenaga wrote:
Hi Chris,
Thanks you for explanation.
Your change looks good (but "last java frame" would not be found
in Linux AMD64 because RSP is NULL - cf. LinuxAMD64CFrame.java)
Thanks,
Yasumasa
On 2020/06/24 12:09, Chris Plummer wrote:
On 6/23/20 6:05 PM, Yasumasa Suenaga wrote:
Hi Chris,
Skillful troubleshooters who use jhsdb will aware this
warnings, and they will take other appropriate methods.
However, I'm not sure it is worth to continue to perform even
if SA cannot get register values.
For example, Linux AMD64 depends on RIP and RSP values to find
top frame.
According to your change, The caller of
getThreadIntegerRegisterSet() has responsible for dealing with
the set of null registers. However X86ThreadContext::data (it
includes raw register values) would still be zero when it
happens.
This is what I intended to have happen. Just end up with a
register set of all nulls. Then when stack walking code gets a
null, it will revert to "last java frame" if available,
otherwise no stack dump is done.
So I think register holder (e.g. X86ThreadContext) should have
tri-state (have registers, fail to get registers, not yet
attempt to get registers).
OTOH it might be over-engineering. What do you think?
Before implementing this I looked at the what would be the
easier approach to get the desired effect of stack walking code
simply failing over to using "last java frame", and decided the
null set of registers was easiest. Other approaches involved
more changes and impacted more files.
thanks,
Chris
Thanks,
Yasumasa
On 2020/06/24 3:16, Chris Plummer wrote:
On 6/20/20 12:53 AM, Yasumasa Suenaga wrote:
Hi Chris,
On 2020/06/20 15:20, Chris Plummer wrote:
Hi Yasumasa,
ptrace is not used for core files, so the EFAULT for a bad
core file is not a possibility. However, get_lwp_regs()
does redirect to core_get_lwp_regs() for core files. It can
fail, but the only reason it ever does is if the LWP can't
be found in the core (which is never suppose to happen). I
would think if this happened due to the core being
truncated, SA would be blowing up all over the place with
exceptions, probably before we ever get to this code, but
in any cast what we do here wouldn't really make a difference.
You are right, sorry.
I'm not sure why you prefer an exception for errors other
than ESRCH. Why should they be treated differently?
getThreadIntegerRegisterSet0() is used for finding the
current frame for stack tracing. With my changes any
failure will result in deferring to "last java frame" if
set, and otherwise just not produce a stack trace (and the
WARNING will be present in the output). This seems
preferable to completely abandoning any further thread
stack tracking.
I'm not sure we can trust call stack when ptrace() returns
any errors other than ESRCH even if "last java frame" is
available. For example, don't ptrace() return EFAULT or EIO
when something wrong? (e.g. stack corruption) If so, it may
lead to a wrong analysis for troubleshooter.
I think it should be abort dumping call stack for its thread
at least.
Hi Yasumasa,
In general stack walking makes a best effort and can be
wrong, even when not getting errors like this. For any
actively executing thread SA needs to determine where the
stack starts, with register contents being the starting point
(SP, FP, and PC). These registers could contain anything, and
SA makes a best effort to determine a current frame from
them. However, the verification steps it takes are not 100%
guaranteed, and can lead to an incorrect assumption of the
current frame, which in turn can result in an exception later
on when walking the stack. See JDK-8247641.
Keep in mind that the WARNING message will always be there.
This should be enough to put the troubleshooter on alert that
the stack trace may not be accurate. I think it's better to
make an attempt at a stack trace then to just abandon it and
not attempt to do something that may be useful.
thanks,
Chris
Thanks,
Yasumasa
thanks,
Chris
On 6/19/20 6:33 PM, Yasumasa Suenaga wrote:
Hi Chris,
I checked Linux kernel code at a glance, ESRCH seems to be
set to errno by default.
So I guess it is similar to "generic" error code.
https://github.com/torvalds/linux/blob/master/kernel/ptrace.c
According to manpage of ptrace(2), it might return errno
other than ESRCH.
For example, if we analyze broken core (e.g. the core was
dumped with disk full), we might get EFAULT.
Thus I prefer to handle ESRCH only in your patch, and also
I think SA should throw DebuggerException if other error
is occurred.
https://www.man7.org/linux/man-pages/man2/ptrace.2.html
Thanks,
Yasumasa
On 2020/06/20 5:51, Chris Plummer wrote:
Hello,
I've updated with webrev based on the new finding that a
JavaThread cannot be on the ThreadList after its OS
thread has been destroyed since the JavaThread removes
itself from the ThreadList, and therefore must be running
on its OS thread. The logic of the fix is unchanged from
the first webrev, but I updated the comments to better
reflect what is going on. I also updated the CR:
https://bugs.openjdk.java.net/browse/JDK-8247533
http://cr.openjdk.java.net/~cjplummer/8247533/webrev.01/index.html
thanks,
Chris
On 6/19/20 12:24 AM, David Holmes wrote:
Hi Chris,
On 19/06/2020 8:55 am, Chris Plummer wrote:
On 6/18/20 1:43 AM, David Holmes wrote:
On 18/06/2020 4:49 pm, Chris Plummer wrote:
On 6/17/20 10:29 PM, David Holmes wrote:
On 18/06/2020 3:13 pm, Chris Plummer wrote:
On 6/17/20 10:09 PM, David Holmes wrote:
On 18/06/2020 2:33 pm, Chris Plummer wrote:
On 6/17/20 7:43 PM, David Holmes wrote:
Hi Chris,
On 18/06/2020 6:34 am, Chris Plummer wrote:
Hello,
Please help review the following:
https://bugs.openjdk.java.net/browse/JDK-8247533
http://cr.openjdk.java.net/~cjplummer/8247533/webrev.00/index.html
The CR contains all the needed details. Here's
a summary of changes in each file:
The problem sounds to me like a variation of the
more general problem of not ensuring a thread is
kept alive whilst acting upon it. I don't know
how the SA finds these references to the threads
it is going to stackwalk, but is it possible to
fix this via appropriate uses of
ThreadsListHandle/Iterator?
It fetches ThreadsSMRSupport::_java_thread_list.
Keep in mind that once SA attaches, nothing in
the VM changes. For example, SA can't create a
wrapper to a JavaThread, only to have the
JavaThread be freed later on. It's just not
possible.
Then how does it obtain a reference to a
JavaThread for which the native OS thread id is
invalid? Any thread found in _java_thread_list is
either live or still to be started. In the latter
case the JavaThread->osThread does not have its
thread_id set yet.
My assumption was that the JavaThread is in the
process of being destroyed, and it has freed its OS
thread but is itself still in the thread list. I
did notice that the OS thread id being used looked
to be in the range of thread id #'s you would
expect for the running app, so that to me indicated
it was once valid, but is no more.
Keep in mind that although hotspot may have
synchronization code that prevents you from pulling
a JavaThread off the thread list when it is in the
process of being destroyed (I'm guessing it does),
SA has no such protections.
But you stated that once the SA has attached, the
target VM can't change. If the SA gets its set of
thread from one attach then tries to make queries
about those threads in a separate attach, then
obviously it could be providing garbage thread
information. So you would need to re-validate the
JavaThread in the target VM before trying to do
anything with it.
That's not what is going on here. It's attaching and
doing a stack trace, which involves getting the
thread list and iterating through all threads without
detaching.
Okay so I restate my original comment - all the
JavaThreads must be alive or not yet started, so how
are you encountering an invalid thread id? Any thread
you find via the ThreadsList can't have destroyed its
osThread. In any case the logic should be checking
thread->osThread() for NULL, and then
osThread()->get_state() to ensure it is >= INITIALIZED
before using the thread_id().
Hi David,
I chatted with Dan about this, and he said since the
JavaThread is responsible for removing itself from the
ThreadList, it is impossible to have a JavaThread still
on the ThreadList, but without and underlying OS
Thread. So I'm a bit perplexed as to how I can find a
JavaThread on the ThreadList, but that results in ESRCH
when trying to access the thread with ptrace. My only
conclusion is that this failure is somehow spurious,
and maybe the issue it just that the thread is in some
temporary state that prevents its access. If so, I
still think the approach I'm taking is the correct one,
but the comments should be updated.
ESRCH can have other meanings but I don't know enough
about the broader context to know whether they are
applicable in this case.
ESRCH The specified process does not exist, or is
not currently being traced by the caller, or is not stopped
(for requests that require a stopped tracee).
I won't comment further on the fix/workaround as I don't
know the code. I'll leave that to other folk.
Cheers,
David
-----
I had one other finding. When this issue first turned
up, it prevented the thread from getting a stack trace
due to the exception being thrown. What I hadn't
realize is that after fixing it to not throw an
exception, which resulted in the stack walking code
getting all nulls for register values, I actually
started to see a stack trace printed:
"JLine terminal non blocking reader thread" #26 daemon
prio=5 tid=0x00007f12f0cd6420 nid=0x1f99 runnable
[0x00007f125f0f4000]
java.lang.Thread.State: RUNNABLE
JavaThread state: _thread_in_native
WARNING: getThreadIntegerRegisterSet0: get_lwp_regs
failed for lwp (8089)
CurrentFrameGuess: choosing last Java frame: sp =
0x00007f125f0f4770, fp = 0x00007f125f0f47c0
- java.io.FileInputStream.read0() @bci=0 (Interpreted
frame)
- java.io.FileInputStream.read() @bci=1, line=223
(Interpreted frame)
-
jdk.internal.org.jline.utils.NonBlockingInputStreamImpl.run()
@bci=108, line=216 (Interpreted frame)
-
jdk.internal.org.jline.utils.NonBlockingInputStreamImpl$$Lambda$536+0x0000000800daeca0.run()
@bci=4 (Interpreted frame)
- java.lang.Thread.run() @bci=11, line=832
(Interpreted frame)
The "CurrentFrameGuess" output is some debug tracing I
had enabled, and it indicates that the stack walking
code is using the "last java frame" setting, which it
will do if current registers values don't indicate a
valid frame (as would be the case if sp was null). I
had previously assumed that without an underling valid
LWP, there would be no stack trace. Given that there is
one, there must be a valid LWP. Otherwise I don't see
how the stack could have been walked. That's another
indication that the ptrace failure is spurious in nature.
thanks,
Chris
Cheers,
David
-----
Also, even if you are using something like clhsdb to
issue commands on addresses, if the address is no
longer valid for the command you are executing, then
you would get the appropriate error when there is an
attempt to create a wrapper for it. I don't know of
any command that operates directly on a JavaThread,
but I think there are for InstanceKlass. So if you
remembered the address of an InstanceKlass, and then
reattached and tried a command that takes an
InstanceKlass address, you would get an exception
when SA tries to create the wrapper for the
InsanceKlass if it were no longer a valid address for
one.
Chris
David
-----
Chris
David
-----
Chris
Cheers,
David
src/jdk.hotspot.agent/linux/native/libsaproc/LinuxDebuggerLocal.cpp
src/jdk.hotspot.agent/macosx/native/libsaproc/MacosxDebuggerLocal.m
src/jdk.hotspot.agent/windows/native/libsaproc/sawindbg.cpp
-Instead of throwing an exception when the OS
ThreadID is invalid, print a warning.
src/jdk.hotspot.agent/linux/native/libsaproc/ps_proc.c
-Improve a print_debug message
src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/debugger/bsd/BsdThread.java
src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/debugger/linux/LinuxThread.java
src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/debugger/windbg/amd64/WindbgAMD64Thread.java
-Deal with the array of registers read in being
null due to the OS ThreadID not being valid.
src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/debugger/bsd/BsdDebuggerLocal.java
src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/debugger/linux/LinuxDebuggerLocal.java
-Fix issue with
"sun.jvm.hotspot.debugger.DebuggerException"
appearing twice when printing the exception.
thanks,
Chris