Re: RFR(S): 8247533: SA stack walking sometimes fails with sun.jvm.hotspot.debugger.DebuggerException: get_thread_regs failed for a lwp

Yasumasa Suenaga Tue, 23 Jun 2020 23:06:31 -0700

Hi Chris,

Thanks you for explanation.
Your change looks good (but "last java frame" would not be found in Linux AMD64 
because RSP is NULL - cf. LinuxAMD64CFrame.java)



Thanks,

Yasumasa


On 2020/06/24 12:09, Chris Plummer wrote:

On 6/23/20 6:05 PM, Yasumasa Suenaga wrote:

Hi Chris,

Skillful troubleshooters who use jhsdb will aware this warnings, and they will 
take other appropriate methods.

However, I'm not sure it is worth to continue to perform even if SA cannot get 
register values.

For example, Linux AMD64 depends on RIP and RSP values to find top frame.
According to your change, The caller of getThreadIntegerRegisterSet() has 
responsible for dealing with the set of null registers. However 
X86ThreadContext::data (it includes raw register values) would still be zero 
when it happens.

This is  what I intended to have happen. Just end up with a register set of all nulls. 
Then when stack walking code gets a null, it will revert to "last java frame" 
if available, otherwise no stack dump is done.


So I think register holder (e.g. X86ThreadContext) should have tri-state (have 
registers, fail to get registers, not yet attempt to get registers).
OTOH it might be over-engineering. What do you think?

Before implementing this I looked at the what would be the easier approach to get the 
desired effect of stack walking code simply failing over to using "last java 
frame", and decided the null set of registers was easiest. Other approaches involved 
more changes and impacted more files.

thanks,

Chris



Thanks,

Yasumasa


On 2020/06/24 3:16, Chris Plummer wrote:

On 6/20/20 12:53 AM, Yasumasa Suenaga wrote:

Hi Chris,

On 2020/06/20 15:20, Chris Plummer wrote:

Hi Yasumasa,

ptrace is not used for core files, so the EFAULT for a bad core file is not a 
possibility. However, get_lwp_regs() does redirect to core_get_lwp_regs() for 
core files. It can fail, but the only reason it ever does is if the LWP can't 
be found in the core (which is never suppose to happen). I would think if this 
happened due to the core being truncated, SA would be blowing up all over the 
place with exceptions, probably before we ever get to this code, but in any 
cast what we do here wouldn't really make a difference.


You are right, sorry.

I'm not sure why you prefer an exception for errors other than ESRCH. Why should they be 
treated differently? getThreadIntegerRegisterSet0() is used for finding the current frame 
for stack tracing. With my changes any failure will result in deferring to "last 
java frame" if set, and otherwise just not produce a stack trace (and the WARNING 
will be present in the output). This seems preferable to completely abandoning any 
further thread stack tracking.


I'm not sure we can trust call stack when ptrace() returns any errors other than ESRCH 
even if "last java frame" is available. For example, don't ptrace() return 
EFAULT or EIO when something wrong? (e.g. stack corruption) If so, it may lead to a wrong 
analysis for troubleshooter.
I think it should be abort dumping call stack for its thread at least.

Hi Yasumasa,

In general stack walking makes a best effort and can be wrong, even when not 
getting errors like this. For any actively executing thread SA needs to 
determine where the stack starts, with register contents being the starting 
point (SP, FP, and PC). These registers could contain anything, and SA makes a 
best effort to determine a current frame from them. However, the verification 
steps it takes are not 100% guaranteed, and can lead to an incorrect assumption 
of the current frame, which in turn can result in an exception later on when 
walking the stack. See JDK-8247641.

Keep in mind that the WARNING message will always be there. This should be 
enough to put the troubleshooter on alert that the stack trace may not be 
accurate. I think it's better to make an attempt at a stack trace then to just 
abandon it and not attempt to do something that may be useful.

thanks,

Chris



Thanks,

Yasumasa

thanks,

Chris

On 6/19/20 6:33 PM, Yasumasa Suenaga wrote:

Hi Chris,

I checked Linux kernel code at a glance, ESRCH seems to be set to errno by 
default.
So I guess it is similar to "generic" error code.

https://github.com/torvalds/linux/blob/master/kernel/ptrace.c

According to manpage of ptrace(2), it might return errno other than ESRCH.
For example, if we analyze broken core (e.g. the core was dumped with disk 
full), we might get EFAULT.
Thus I prefer to handle ESRCH only in your patch, and also I think SA should 
throw DebuggerException if other error is occurred.

https://www.man7.org/linux/man-pages/man2/ptrace.2.html


Thanks,

Yasumasa


On 2020/06/20 5:51, Chris Plummer wrote:

Hello,

I've  updated with webrev based on the new finding that a JavaThread cannot be 
on the ThreadList after its OS thread has been destroyed since the JavaThread 
removes itself from the ThreadList, and therefore must be running on its OS 
thread. The logic of the fix is unchanged from the first webrev, but I updated 
the comments to better reflect what is going on. I also updated the CR:

https://bugs.openjdk.java.net/browse/JDK-8247533
http://cr.openjdk.java.net/~cjplummer/8247533/webrev.01/index.html

thanks,

Chris

On 6/19/20 12:24 AM, David Holmes wrote:

Hi Chris,

On 19/06/2020 8:55 am, Chris Plummer wrote:

On 6/18/20 1:43 AM, David Holmes wrote:

On 18/06/2020 4:49 pm, Chris Plummer wrote:

On 6/17/20 10:29 PM, David Holmes wrote:

On 18/06/2020 3:13 pm, Chris Plummer wrote:

On 6/17/20 10:09 PM, David Holmes wrote:

On 18/06/2020 2:33 pm, Chris Plummer wrote:

On 6/17/20 7:43 PM, David Holmes wrote:

Hi Chris,

On 18/06/2020 6:34 am, Chris Plummer wrote:

Hello,

Please help review the following:

https://bugs.openjdk.java.net/browse/JDK-8247533
http://cr.openjdk.java.net/~cjplummer/8247533/webrev.00/index.html

The CR contains all the needed details. Here's a summary of changes in each 
file:


The problem sounds to me like a variation of the more general problem of not 
ensuring a thread is kept alive whilst acting upon it. I don't know how the SA 
finds these references to the threads it is going to stackwalk, but is it 
possible to fix this via appropriate uses of ThreadsListHandle/Iterator?

It fetches ThreadsSMRSupport::_java_thread_list.

Keep in mind that once SA attaches, nothing in the VM changes. For example, SA 
can't create a wrapper to a JavaThread, only to have the JavaThread be freed 
later on. It's just not possible.


Then how does it obtain a reference to a JavaThread for which the native OS thread 
id is invalid? Any thread found in _java_thread_list is either live or still to be 
started. In the latter case the JavaThread->osThread does not have its 
thread_id set yet.

My assumption was that the JavaThread is in the process of being destroyed, and 
it has freed its OS thread but is itself still in the thread list. I did notice 
that the OS thread id being used looked to be in the range of thread id #'s you 
would expect for the running app, so that to me indicated it was once valid, 
but is no more.

Keep in mind that although hotspot may have synchronization code that prevents 
you from pulling a JavaThread off the thread list when it is in the process of 
being destroyed (I'm guessing it does), SA has no such protections.


But you stated that once the SA has attached, the target VM can't change. If 
the SA gets its set of thread from one attach then tries to make queries about 
those threads in a separate attach, then obviously it could be providing 
garbage thread information. So you would need to re-validate the JavaThread in 
the target VM before trying to do anything with it.

That's not what is going on here. It's attaching and doing a stack trace, which involves getting the thread list and iterating through all threads without detaching.


Okay so I restate my original comment - all the JavaThreads must be alive or not yet 
started, so how are you encountering an invalid thread id? Any thread you find via the 
ThreadsList can't have destroyed its osThread. In any case the logic should be checking 
thread->osThread() for NULL, and then osThread()->get_state() to ensure it is 
>= INITIALIZED before using the thread_id().

Hi David,

I chatted with Dan about this, and he said since the JavaThread is responsible 
for removing itself from the ThreadList, it is impossible to have a JavaThread 
still on the ThreadList, but without and underlying OS Thread. So I'm a bit 
perplexed as to how I can find a JavaThread on the ThreadList, but that results 
in ESRCH when trying to access the thread with ptrace. My only conclusion is 
that this failure is somehow spurious, and maybe the issue it just that the 
thread is in some temporary state that prevents its access. If so, I still 
think the approach I'm taking is the correct one, but the comments should be 
updated.


ESRCH can have other meanings but I don't know enough about the broader context 
to know whether they are applicable in this case.

    ESRCH  The  specified  process  does not exist, or is not currently being 
traced by the caller, or is not stopped
              (for requests that require a stopped tracee).

I won't comment further on the fix/workaround as I don't know the code. I'll 
leave that to other folk.

Cheers,
David
-----

I had one other finding. When this issue first turned up, it prevented the 
thread from getting a stack trace due to the exception being thrown. What I 
hadn't realize is that after fixing it to not throw an exception, which 
resulted in the stack walking code getting all nulls for register values, I 
actually started to see a stack trace printed:

"JLine terminal non blocking reader thread" #26 daemon prio=5 
tid=0x00007f12f0cd6420 nid=0x1f99 runnable [0x00007f125f0f4000]
    java.lang.Thread.State: RUNNABLE
    JavaThread state: _thread_in_native
WARNING: getThreadIntegerRegisterSet0: get_lwp_regs failed for lwp (8089)
CurrentFrameGuess: choosing last Java frame: sp = 0x00007f125f0f4770, fp = 
0x00007f125f0f47c0
  - java.io.FileInputStream.read0() @bci=0 (Interpreted frame)
  - java.io.FileInputStream.read() @bci=1, line=223 (Interpreted frame)
  - jdk.internal.org.jline.utils.NonBlockingInputStreamImpl.run() @bci=108, 
line=216 (Interpreted frame)
  - 
jdk.internal.org.jline.utils.NonBlockingInputStreamImpl$$Lambda$536+0x0000000800daeca0.run()
 @bci=4 (Interpreted frame)
  - java.lang.Thread.run() @bci=11, line=832 (Interpreted frame)

The "CurrentFrameGuess" output is some debug tracing I had enabled, and it indicates that 
the stack walking code is using the "last java frame" setting, which it will do if 
current registers values don't indicate a valid frame (as would be the case if sp was null). I had 
previously assumed that without an underling valid LWP, there would be no stack trace. Given that 
there is one, there must be a valid LWP. Otherwise I don't see how the stack could have been 
walked. That's another indication that the ptrace failure is spurious in nature.

thanks,

Chris


Cheers,
David
-----

Also, even if you are using something like clhsdb to issue commands on 
addresses, if the address is no longer valid for the command you are executing, 
then you would get the appropriate error when there is an attempt to create a 
wrapper for it. I don't know of any command that operates directly on a 
JavaThread, but I think there are for InstanceKlass. So if you remembered the 
address of an InstanceKlass, and then reattached and tried a command that takes 
an InstanceKlass address, you would get an exception when SA tries to create 
the wrapper for the InsanceKlass if it were no longer a valid address for one.

Chris


David
-----

Chris

David
-----

Chris


Cheers,
David

src/jdk.hotspot.agent/linux/native/libsaproc/LinuxDebuggerLocal.cpp
src/jdk.hotspot.agent/macosx/native/libsaproc/MacosxDebuggerLocal.m
src/jdk.hotspot.agent/windows/native/libsaproc/sawindbg.cpp
-Instead of throwing an exception when the OS ThreadID is invalid, print a 
warning.

src/jdk.hotspot.agent/linux/native/libsaproc/ps_proc.c
-Improve a print_debug message

src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/debugger/bsd/BsdThread.java
src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/debugger/linux/LinuxThread.java
src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/debugger/windbg/amd64/WindbgAMD64Thread.java
-Deal with the array of registers read in being null due to the OS ThreadID not 
being valid.

src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/debugger/bsd/BsdDebuggerLocal.java
src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/debugger/linux/LinuxDebuggerLocal.java
-Fix issue with "sun.jvm.hotspot.debugger.DebuggerException" appearing twice 
when printing the exception.

thanks,

Chris

Re: RFR(S): 8247533: SA stack walking sometimes fails with sun.jvm.hotspot.debugger.DebuggerException: get_thread_regs failed for a lwp

Reply via email to