Hi Yasumasa ,
I think LinuxAMD64CFrame is used for pstack and what I've been looking
at has been jstack, and in particular AMD64CurrentFrameGuess, which does
use "last java frame". I'm not sure why LinuxAMD64CFrame does not look
at "last java frame". Maybe it should.
thanks,
Chris
On 6/23/20 11:04 PM, Yasumasa Suenaga wrote:
Hi Chris,
Thanks you for explanation.
Your change looks good (but "last java frame" would not be found in
Linux AMD64 because RSP is NULL - cf. LinuxAMD64CFrame.java)
Thanks,
Yasumasa
On 2020/06/24 12:09, Chris Plummer wrote:
On 6/23/20 6:05 PM, Yasumasa Suenaga wrote:
Hi Chris,
Skillful troubleshooters who use jhsdb will aware this warnings, and
they will take other appropriate methods.
However, I'm not sure it is worth to continue to perform even if SA
cannot get register values.
For example, Linux AMD64 depends on RIP and RSP values to find top
frame.
According to your change, The caller of
getThreadIntegerRegisterSet() has responsible for dealing with the
set of null registers. However X86ThreadContext::data (it includes
raw register values) would still be zero when it happens.
This is what I intended to have happen. Just end up with a register
set of all nulls. Then when stack walking code gets a null, it will
revert to "last java frame" if available, otherwise no stack dump is
done.
So I think register holder (e.g. X86ThreadContext) should have
tri-state (have registers, fail to get registers, not yet attempt to
get registers).
OTOH it might be over-engineering. What do you think?
Before implementing this I looked at the what would be the easier
approach to get the desired effect of stack walking code simply
failing over to using "last java frame", and decided the null set of
registers was easiest. Other approaches involved more changes and
impacted more files.
thanks,
Chris
Thanks,
Yasumasa
On 2020/06/24 3:16, Chris Plummer wrote:
On 6/20/20 12:53 AM, Yasumasa Suenaga wrote:
Hi Chris,
On 2020/06/20 15:20, Chris Plummer wrote:
Hi Yasumasa,
ptrace is not used for core files, so the EFAULT for a bad core
file is not a possibility. However, get_lwp_regs() does redirect
to core_get_lwp_regs() for core files. It can fail, but the only
reason it ever does is if the LWP can't be found in the core
(which is never suppose to happen). I would think if this
happened due to the core being truncated, SA would be blowing up
all over the place with exceptions, probably before we ever get
to this code, but in any cast what we do here wouldn't really
make a difference.
You are right, sorry.
I'm not sure why you prefer an exception for errors other than
ESRCH. Why should they be treated differently?
getThreadIntegerRegisterSet0() is used for finding the current
frame for stack tracing. With my changes any failure will result
in deferring to "last java frame" if set, and otherwise just not
produce a stack trace (and the WARNING will be present in the
output). This seems preferable to completely abandoning any
further thread stack tracking.
I'm not sure we can trust call stack when ptrace() returns any
errors other than ESRCH even if "last java frame" is available.
For example, don't ptrace() return EFAULT or EIO when something
wrong? (e.g. stack corruption) If so, it may lead to a wrong
analysis for troubleshooter.
I think it should be abort dumping call stack for its thread at
least.
Hi Yasumasa,
In general stack walking makes a best effort and can be wrong, even
when not getting errors like this. For any actively executing
thread SA needs to determine where the stack starts, with register
contents being the starting point (SP, FP, and PC). These registers
could contain anything, and SA makes a best effort to determine a
current frame from them. However, the verification steps it takes
are not 100% guaranteed, and can lead to an incorrect assumption of
the current frame, which in turn can result in an exception later
on when walking the stack. See JDK-8247641.
Keep in mind that the WARNING message will always be there. This
should be enough to put the troubleshooter on alert that the stack
trace may not be accurate. I think it's better to make an attempt
at a stack trace then to just abandon it and not attempt to do
something that may be useful.
thanks,
Chris
Thanks,
Yasumasa
thanks,
Chris
On 6/19/20 6:33 PM, Yasumasa Suenaga wrote:
Hi Chris,
I checked Linux kernel code at a glance, ESRCH seems to be set
to errno by default.
So I guess it is similar to "generic" error code.
https://github.com/torvalds/linux/blob/master/kernel/ptrace.c
According to manpage of ptrace(2), it might return errno other
than ESRCH.
For example, if we analyze broken core (e.g. the core was dumped
with disk full), we might get EFAULT.
Thus I prefer to handle ESRCH only in your patch, and also I
think SA should throw DebuggerException if other error is occurred.
https://www.man7.org/linux/man-pages/man2/ptrace.2.html
Thanks,
Yasumasa
On 2020/06/20 5:51, Chris Plummer wrote:
Hello,
I've updated with webrev based on the new finding that a
JavaThread cannot be on the ThreadList after its OS thread has
been destroyed since the JavaThread removes itself from the
ThreadList, and therefore must be running on its OS thread. The
logic of the fix is unchanged from the first webrev, but I
updated the comments to better reflect what is going on. I also
updated the CR:
https://bugs.openjdk.java.net/browse/JDK-8247533
http://cr.openjdk.java.net/~cjplummer/8247533/webrev.01/index.html
thanks,
Chris
On 6/19/20 12:24 AM, David Holmes wrote:
Hi Chris,
On 19/06/2020 8:55 am, Chris Plummer wrote:
On 6/18/20 1:43 AM, David Holmes wrote:
On 18/06/2020 4:49 pm, Chris Plummer wrote:
On 6/17/20 10:29 PM, David Holmes wrote:
On 18/06/2020 3:13 pm, Chris Plummer wrote:
On 6/17/20 10:09 PM, David Holmes wrote:
On 18/06/2020 2:33 pm, Chris Plummer wrote:
On 6/17/20 7:43 PM, David Holmes wrote:
Hi Chris,
On 18/06/2020 6:34 am, Chris Plummer wrote:
Hello,
Please help review the following:
https://bugs.openjdk.java.net/browse/JDK-8247533
http://cr.openjdk.java.net/~cjplummer/8247533/webrev.00/index.html
The CR contains all the needed details. Here's a
summary of changes in each file:
The problem sounds to me like a variation of the more
general problem of not ensuring a thread is kept alive
whilst acting upon it. I don't know how the SA finds
these references to the threads it is going to
stackwalk, but is it possible to fix this via
appropriate uses of ThreadsListHandle/Iterator?
It fetches ThreadsSMRSupport::_java_thread_list.
Keep in mind that once SA attaches, nothing in the VM
changes. For example, SA can't create a wrapper to a
JavaThread, only to have the JavaThread be freed later
on. It's just not possible.
Then how does it obtain a reference to a JavaThread for
which the native OS thread id is invalid? Any thread
found in _java_thread_list is either live or still to be
started. In the latter case the JavaThread->osThread
does not have its thread_id set yet.
My assumption was that the JavaThread is in the process
of being destroyed, and it has freed its OS thread but is
itself still in the thread list. I did notice that the OS
thread id being used looked to be in the range of thread
id #'s you would expect for the running app, so that to
me indicated it was once valid, but is no more.
Keep in mind that although hotspot may have
synchronization code that prevents you from pulling a
JavaThread off the thread list when it is in the process
of being destroyed (I'm guessing it does), SA has no such
protections.
But you stated that once the SA has attached, the target
VM can't change. If the SA gets its set of thread from one
attach then tries to make queries about those threads in a
separate attach, then obviously it could be providing
garbage thread information. So you would need to
re-validate the JavaThread in the target VM before trying
to do anything with it.
That's not what is going on here. It's attaching and doing
a stack trace, which involves getting the thread list and
iterating through all threads without detaching.
Okay so I restate my original comment - all the JavaThreads
must be alive or not yet started, so how are you
encountering an invalid thread id? Any thread you find via
the ThreadsList can't have destroyed its osThread. In any
case the logic should be checking thread->osThread() for
NULL, and then osThread()->get_state() to ensure it is >=
INITIALIZED before using the thread_id().
Hi David,
I chatted with Dan about this, and he said since the
JavaThread is responsible for removing itself from the
ThreadList, it is impossible to have a JavaThread still on
the ThreadList, but without and underlying OS Thread. So I'm
a bit perplexed as to how I can find a JavaThread on the
ThreadList, but that results in ESRCH when trying to access
the thread with ptrace. My only conclusion is that this
failure is somehow spurious, and maybe the issue it just that
the thread is in some temporary state that prevents its
access. If so, I still think the approach I'm taking is the
correct one, but the comments should be updated.
ESRCH can have other meanings but I don't know enough about
the broader context to know whether they are applicable in
this case.
ESRCH The specified process does not exist, or is not
currently being traced by the caller, or is not stopped
(for requests that require a stopped tracee).
I won't comment further on the fix/workaround as I don't know
the code. I'll leave that to other folk.
Cheers,
David
-----
I had one other finding. When this issue first turned up, it
prevented the thread from getting a stack trace due to the
exception being thrown. What I hadn't realize is that after
fixing it to not throw an exception, which resulted in the
stack walking code getting all nulls for register values, I
actually started to see a stack trace printed:
"JLine terminal non blocking reader thread" #26 daemon prio=5
tid=0x00007f12f0cd6420 nid=0x1f99 runnable [0x00007f125f0f4000]
java.lang.Thread.State: RUNNABLE
JavaThread state: _thread_in_native
WARNING: getThreadIntegerRegisterSet0: get_lwp_regs failed
for lwp (8089)
CurrentFrameGuess: choosing last Java frame: sp =
0x00007f125f0f4770, fp = 0x00007f125f0f47c0
- java.io.FileInputStream.read0() @bci=0 (Interpreted frame)
- java.io.FileInputStream.read() @bci=1, line=223
(Interpreted frame)
-
jdk.internal.org.jline.utils.NonBlockingInputStreamImpl.run()
@bci=108, line=216 (Interpreted frame)
-
jdk.internal.org.jline.utils.NonBlockingInputStreamImpl$$Lambda$536+0x0000000800daeca0.run()
@bci=4 (Interpreted frame)
- java.lang.Thread.run() @bci=11, line=832 (Interpreted frame)
The "CurrentFrameGuess" output is some debug tracing I had
enabled, and it indicates that the stack walking code is
using the "last java frame" setting, which it will do if
current registers values don't indicate a valid frame (as
would be the case if sp was null). I had previously assumed
that without an underling valid LWP, there would be no stack
trace. Given that there is one, there must be a valid LWP.
Otherwise I don't see how the stack could have been walked.
That's another indication that the ptrace failure is spurious
in nature.
thanks,
Chris
Cheers,
David
-----
Also, even if you are using something like clhsdb to issue
commands on addresses, if the address is no longer valid
for the command you are executing, then you would get the
appropriate error when there is an attempt to create a
wrapper for it. I don't know of any command that operates
directly on a JavaThread, but I think there are for
InstanceKlass. So if you remembered the address of an
InstanceKlass, and then reattached and tried a command that
takes an InstanceKlass address, you would get an exception
when SA tries to create the wrapper for the InsanceKlass if
it were no longer a valid address for one.
Chris
David
-----
Chris
David
-----
Chris
Cheers,
David
src/jdk.hotspot.agent/linux/native/libsaproc/LinuxDebuggerLocal.cpp
src/jdk.hotspot.agent/macosx/native/libsaproc/MacosxDebuggerLocal.m
src/jdk.hotspot.agent/windows/native/libsaproc/sawindbg.cpp
-Instead of throwing an exception when the OS
ThreadID is invalid, print a warning.
src/jdk.hotspot.agent/linux/native/libsaproc/ps_proc.c
-Improve a print_debug message
src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/debugger/bsd/BsdThread.java
src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/debugger/linux/LinuxThread.java
src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/debugger/windbg/amd64/WindbgAMD64Thread.java
-Deal with the array of registers read in being null
due to the OS ThreadID not being valid.
src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/debugger/bsd/BsdDebuggerLocal.java
src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/debugger/linux/LinuxDebuggerLocal.java
-Fix issue with
"sun.jvm.hotspot.debugger.DebuggerException"
appearing twice when printing the exception.
thanks,
Chris