On 6/20/20 12:53 AM, Yasumasa Suenaga wrote:
Hi Chris,
On 2020/06/20 15:20, Chris Plummer wrote:
Hi Yasumasa,
ptrace is not used for core files, so the EFAULT for a bad core file
is not a possibility. However, get_lwp_regs() does redirect to
core_get_lwp_regs() for core files. It can fail, but the only reason
it ever does is if the LWP can't be found in the core (which is never
suppose to happen). I would think if this happened due to the core
being truncated, SA would be blowing up all over the place with
exceptions, probably before we ever get to this code, but in any cast
what we do here wouldn't really make a difference.
You are right, sorry.
I'm not sure why you prefer an exception for errors other than ESRCH.
Why should they be treated differently?
getThreadIntegerRegisterSet0() is used for finding the current frame
for stack tracing. With my changes any failure will result in
deferring to "last java frame" if set, and otherwise just not produce
a stack trace (and the WARNING will be present in the output). This
seems preferable to completely abandoning any further thread stack
tracking.
I'm not sure we can trust call stack when ptrace() returns any errors
other than ESRCH even if "last java frame" is available. For example,
don't ptrace() return EFAULT or EIO when something wrong? (e.g. stack
corruption) If so, it may lead to a wrong analysis for troubleshooter.
I think it should be abort dumping call stack for its thread at least.
Hi Yasumasa,
In general stack walking makes a best effort and can be wrong, even when
not getting errors like this. For any actively executing thread SA needs
to determine where the stack starts, with register contents being the
starting point (SP, FP, and PC). These registers could contain anything,
and SA makes a best effort to determine a current frame from them.
However, the verification steps it takes are not 100% guaranteed, and
can lead to an incorrect assumption of the current frame, which in turn
can result in an exception later on when walking the stack. See JDK-8247641.
Keep in mind that the WARNING message will always be there. This should
be enough to put the troubleshooter on alert that the stack trace may
not be accurate. I think it's better to make an attempt at a stack trace
then to just abandon it and not attempt to do something that may be useful.
thanks,
Chris
Thanks,
Yasumasa
thanks,
Chris
On 6/19/20 6:33 PM, Yasumasa Suenaga wrote:
Hi Chris,
I checked Linux kernel code at a glance, ESRCH seems to be set to
errno by default.
So I guess it is similar to "generic" error code.
https://github.com/torvalds/linux/blob/master/kernel/ptrace.c
According to manpage of ptrace(2), it might return errno other than
ESRCH.
For example, if we analyze broken core (e.g. the core was dumped
with disk full), we might get EFAULT.
Thus I prefer to handle ESRCH only in your patch, and also I think
SA should throw DebuggerException if other error is occurred.
https://www.man7.org/linux/man-pages/man2/ptrace.2.html
Thanks,
Yasumasa
On 2020/06/20 5:51, Chris Plummer wrote:
Hello,
I've updated with webrev based on the new finding that a
JavaThread cannot be on the ThreadList after its OS thread has been
destroyed since the JavaThread removes itself from the ThreadList,
and therefore must be running on its OS thread. The logic of the
fix is unchanged from the first webrev, but I updated the comments
to better reflect what is going on. I also updated the CR:
https://bugs.openjdk.java.net/browse/JDK-8247533
http://cr.openjdk.java.net/~cjplummer/8247533/webrev.01/index.html
thanks,
Chris
On 6/19/20 12:24 AM, David Holmes wrote:
Hi Chris,
On 19/06/2020 8:55 am, Chris Plummer wrote:
On 6/18/20 1:43 AM, David Holmes wrote:
On 18/06/2020 4:49 pm, Chris Plummer wrote:
On 6/17/20 10:29 PM, David Holmes wrote:
On 18/06/2020 3:13 pm, Chris Plummer wrote:
On 6/17/20 10:09 PM, David Holmes wrote:
On 18/06/2020 2:33 pm, Chris Plummer wrote:
On 6/17/20 7:43 PM, David Holmes wrote:
Hi Chris,
On 18/06/2020 6:34 am, Chris Plummer wrote:
Hello,
Please help review the following:
https://bugs.openjdk.java.net/browse/JDK-8247533
http://cr.openjdk.java.net/~cjplummer/8247533/webrev.00/index.html
The CR contains all the needed details. Here's a summary
of changes in each file:
The problem sounds to me like a variation of the more
general problem of not ensuring a thread is kept alive
whilst acting upon it. I don't know how the SA finds these
references to the threads it is going to stackwalk, but is
it possible to fix this via appropriate uses of
ThreadsListHandle/Iterator?
It fetches ThreadsSMRSupport::_java_thread_list.
Keep in mind that once SA attaches, nothing in the VM
changes. For example, SA can't create a wrapper to a
JavaThread, only to have the JavaThread be freed later on.
It's just not possible.
Then how does it obtain a reference to a JavaThread for
which the native OS thread id is invalid? Any thread found
in _java_thread_list is either live or still to be started.
In the latter case the JavaThread->osThread does not have
its thread_id set yet.
My assumption was that the JavaThread is in the process of
being destroyed, and it has freed its OS thread but is itself
still in the thread list. I did notice that the OS thread id
being used looked to be in the range of thread id #'s you
would expect for the running app, so that to me indicated it
was once valid, but is no more.
Keep in mind that although hotspot may have synchronization
code that prevents you from pulling a JavaThread off the
thread list when it is in the process of being destroyed (I'm
guessing it does), SA has no such protections.
But you stated that once the SA has attached, the target VM
can't change. If the SA gets its set of thread from one attach
then tries to make queries about those threads in a separate
attach, then obviously it could be providing garbage thread
information. So you would need to re-validate the JavaThread
in the target VM before trying to do anything with it.
That's not what is going on here. It's attaching and doing a
stack trace, which involves getting the thread list and
iterating through all threads without detaching.
Okay so I restate my original comment - all the JavaThreads must
be alive or not yet started, so how are you encountering an
invalid thread id? Any thread you find via the ThreadsList can't
have destroyed its osThread. In any case the logic should be
checking thread->osThread() for NULL, and then
osThread()->get_state() to ensure it is >= INITIALIZED before
using the thread_id().
Hi David,
I chatted with Dan about this, and he said since the JavaThread
is responsible for removing itself from the ThreadList, it is
impossible to have a JavaThread still on the ThreadList, but
without and underlying OS Thread. So I'm a bit perplexed as to
how I can find a JavaThread on the ThreadList, but that results
in ESRCH when trying to access the thread with ptrace. My only
conclusion is that this failure is somehow spurious, and maybe
the issue it just that the thread is in some temporary state that
prevents its access. If so, I still think the approach I'm taking
is the correct one, but the comments should be updated.
ESRCH can have other meanings but I don't know enough about the
broader context to know whether they are applicable in this case.
ESRCH The specified process does not exist, or is not
currently being traced by the caller, or is not stopped
(for requests that require a stopped tracee).
I won't comment further on the fix/workaround as I don't know the
code. I'll leave that to other folk.
Cheers,
David
-----
I had one other finding. When this issue first turned up, it
prevented the thread from getting a stack trace due to the
exception being thrown. What I hadn't realize is that after
fixing it to not throw an exception, which resulted in the stack
walking code getting all nulls for register values, I actually
started to see a stack trace printed:
"JLine terminal non blocking reader thread" #26 daemon prio=5
tid=0x00007f12f0cd6420 nid=0x1f99 runnable [0x00007f125f0f4000]
java.lang.Thread.State: RUNNABLE
JavaThread state: _thread_in_native
WARNING: getThreadIntegerRegisterSet0: get_lwp_regs failed for
lwp (8089)
CurrentFrameGuess: choosing last Java frame: sp =
0x00007f125f0f4770, fp = 0x00007f125f0f47c0
- java.io.FileInputStream.read0() @bci=0 (Interpreted frame)
- java.io.FileInputStream.read() @bci=1, line=223 (Interpreted
frame)
- jdk.internal.org.jline.utils.NonBlockingInputStreamImpl.run()
@bci=108, line=216 (Interpreted frame)
-
jdk.internal.org.jline.utils.NonBlockingInputStreamImpl$$Lambda$536+0x0000000800daeca0.run()
@bci=4 (Interpreted frame)
- java.lang.Thread.run() @bci=11, line=832 (Interpreted frame)
The "CurrentFrameGuess" output is some debug tracing I had
enabled, and it indicates that the stack walking code is using
the "last java frame" setting, which it will do if current
registers values don't indicate a valid frame (as would be the
case if sp was null). I had previously assumed that without an
underling valid LWP, there would be no stack trace. Given that
there is one, there must be a valid LWP. Otherwise I don't see
how the stack could have been walked. That's another indication
that the ptrace failure is spurious in nature.
thanks,
Chris
Cheers,
David
-----
Also, even if you are using something like clhsdb to issue
commands on addresses, if the address is no longer valid for
the command you are executing, then you would get the
appropriate error when there is an attempt to create a wrapper
for it. I don't know of any command that operates directly on a
JavaThread, but I think there are for InstanceKlass. So if you
remembered the address of an InstanceKlass, and then reattached
and tried a command that takes an InstanceKlass address, you
would get an exception when SA tries to create the wrapper for
the InsanceKlass if it were no longer a valid address for one.
Chris
David
-----
Chris
David
-----
Chris
Cheers,
David
src/jdk.hotspot.agent/linux/native/libsaproc/LinuxDebuggerLocal.cpp
src/jdk.hotspot.agent/macosx/native/libsaproc/MacosxDebuggerLocal.m
src/jdk.hotspot.agent/windows/native/libsaproc/sawindbg.cpp
-Instead of throwing an exception when the OS ThreadID is
invalid, print a warning.
src/jdk.hotspot.agent/linux/native/libsaproc/ps_proc.c
-Improve a print_debug message
src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/debugger/bsd/BsdThread.java
src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/debugger/linux/LinuxThread.java
src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/debugger/windbg/amd64/WindbgAMD64Thread.java
-Deal with the array of registers read in being null due
to the OS ThreadID not being valid.
src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/debugger/bsd/BsdDebuggerLocal.java
src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/debugger/linux/LinuxDebuggerLocal.java
-Fix issue with
"sun.jvm.hotspot.debugger.DebuggerException" appearing
twice when printing the exception.
thanks,
Chris