Re: RFR(S): 8247533: SA stack walking sometimes fails with sun.jvm.hotspot.debugger.DebuggerException: get_thread_regs failed for a lwp

Chris Plummer Sat, 20 Jun 2020 00:08:48 -0700

Hi Yasumasa,

ptrace is not used for core files, so the EFAULT for a bad core file isnot a possibility. However, get_lwp_regs() does redirect tocore_get_lwp_regs() for core files. It can fail, but the only reason itever does is if the LWP can't be found in the core (which is neversuppose to happen). I would think if this happened due to the core beingtruncated, SA would be blowing up all over the place with exceptions,probably before we ever get to this code, but in any cast what we dohere wouldn't really make a difference.

I'm not sure why you prefer an exception for errors other than ESRCH.Why should they be treated differently? getThreadIntegerRegisterSet0()is used for finding the current frame for stack tracing. With my changesany failure will result in deferring to "last java frame" if set, andotherwise just not produce a stack trace (and the WARNING will bepresent in the output). This seems preferable to completely abandoningany further thread stack tracking.


thanks,

Chris

On 6/19/20 6:33 PM, Yasumasa Suenaga wrote:

Hi Chris,
I checked Linux kernel code at a glance, ESRCH seems to be set toerrno by default.
So I guess it is similar to "generic" error code.

https://github.com/torvalds/linux/blob/master/kernel/ptrace.c
According to manpage of ptrace(2), it might return errno other thanESRCH.For example, if we analyze broken core (e.g. the core was dumped withdisk full), we might get EFAULT.Thus I prefer to handle ESRCH only in your patch, and also I think SAshould throw DebuggerException if other error is occurred.
https://www.man7.org/linux/man-pages/man2/ptrace.2.html


Thanks,

Yasumasa


On 2020/06/20 5:51, Chris Plummer wrote:
Hello,
I've updated with webrev based on the new finding that a JavaThreadcannot be on the ThreadList after its OS thread has been destroyedsince the JavaThread removes itself from the ThreadList, andtherefore must be running on its OS thread. The logic of the fix isunchanged from the first webrev, but I updated the comments to betterreflect what is going on. I also updated the CR:
https://bugs.openjdk.java.net/browse/JDK-8247533
http://cr.openjdk.java.net/~cjplummer/8247533/webrev.01/index.html

thanks,

Chris

On 6/19/20 12:24 AM, David Holmes wrote:
Hi Chris,

On 19/06/2020 8:55 am, Chris Plummer wrote:
On 6/18/20 1:43 AM, David Holmes wrote:
On 18/06/2020 4:49 pm, Chris Plummer wrote:
On 6/17/20 10:29 PM, David Holmes wrote:
On 18/06/2020 3:13 pm, Chris Plummer wrote:
On 6/17/20 10:09 PM, David Holmes wrote:
On 18/06/2020 2:33 pm, Chris Plummer wrote:
On 6/17/20 7:43 PM, David Holmes wrote:
Hi Chris,

On 18/06/2020 6:34 am, Chris Plummer wrote:
Hello,

Please help review the following:

https://bugs.openjdk.java.net/browse/JDK-8247533
http://cr.openjdk.java.net/~cjplummer/8247533/webrev.00/index.html
The CR contains all the needed details. Here's a summary ofchanges in each file:
The problem sounds to me like a variation of the moregeneral problem of not ensuring a thread is kept alivewhilst acting upon it. I don't know how the SA finds thesereferences to the threads it is going to stackwalk, but isit possible to fix this via appropriate uses ofThreadsListHandle/Iterator?
It fetches ThreadsSMRSupport::_java_thread_list.
Keep in mind that once SA attaches, nothing in the VMchanges. For example, SA can't create a wrapper to aJavaThread, only to have the JavaThread be freed later on.It's just not possible.
Then how does it obtain a reference to a JavaThread for whichthe native OS thread id is invalid? Any thread found in_java_thread_list is either live or still to be started. Inthe latter case the JavaThread->osThread does not have itsthread_id set yet.
My assumption was that the JavaThread is in the process ofbeing destroyed, and it has freed its OS thread but is itselfstill in the thread list. I did notice that the OS thread idbeing used looked to be in the range of thread id #'s you wouldexpect for the running app, so that to me indicated it was oncevalid, but is no more.
Keep in mind that although hotspot may have synchronizationcode that prevents you from pulling a JavaThread off the threadlist when it is in the process of being destroyed (I'm guessingit does), SA has no such protections.
But you stated that once the SA has attached, the target VMcan't change. If the SA gets its set of thread from one attachthen tries to make queries about those threads in a separateattach, then obviously it could be providing garbage threadinformation. So you would need to re-validate the JavaThread inthe target VM before trying to do anything with it.
That's not what is going on here. It's attaching and doing astack trace, which involves getting the thread list and iteratingthrough all threads without detaching.
Okay so I restate my original comment - all the JavaThreads mustbe alive or not yet started, so how are you encountering aninvalid thread id? Any thread you find via the ThreadsList can'thave destroyed its osThread. In any case the logic should bechecking thread->osThread() for NULL, and thenosThread()->get_state() to ensure it is >= INITIALIZED beforeusing the thread_id().
Hi David,
I chatted with Dan about this, and he said since the JavaThread isresponsible for removing itself from the ThreadList, it isimpossible to have a JavaThread still on the ThreadList, butwithout and underlying OS Thread. So I'm a bit perplexed as to howI can find a JavaThread on the ThreadList, but that results inESRCH when trying to access the thread with ptrace. My onlyconclusion is that this failure is somehow spurious, and maybe theissue it just that the thread is in some temporary state thatprevents its access. If so, I still think the approach I'm takingis the correct one, but the comments should be updated.
ESRCH can have other meanings but I don't know enough about thebroader context to know whether they are applicable in this case.
ESRCH The specified process does not exist, or is notcurrently being traced by the caller, or is not stopped
              (for requests that require a stopped tracee).
I won't comment further on the fix/workaround as I don't know thecode. I'll leave that to other folk.
Cheers,
David
-----
I had one other finding. When this issue first turned up, itprevented the thread from getting a stack trace due to theexception being thrown. What I hadn't realize is that after fixingit to not throw an exception, which resulted in the stack walkingcode getting all nulls for register values, I actually started tosee a stack trace printed:
"JLine terminal non blocking reader thread" #26 daemon prio=5tid=0x00007f12f0cd6420 nid=0x1f99 runnable [0x00007f125f0f4000]
    java.lang.Thread.State: RUNNABLE
    JavaThread state: _thread_in_native
WARNING: getThreadIntegerRegisterSet0: get_lwp_regs failed for lwp(8089)CurrentFrameGuess: choosing last Java frame: sp =0x00007f125f0f4770, fp = 0x00007f125f0f47c0
  - java.io.FileInputStream.read0() @bci=0 (Interpreted frame)
- java.io.FileInputStream.read() @bci=1, line=223 (Interpretedframe) - jdk.internal.org.jline.utils.NonBlockingInputStreamImpl.run()@bci=108, line=216 (Interpreted frame) -jdk.internal.org.jline.utils.NonBlockingInputStreamImpl$$Lambda$536+0x0000000800daeca0.run()@bci=4 (Interpreted frame)
  - java.lang.Thread.run() @bci=11, line=832 (Interpreted frame)
The "CurrentFrameGuess" output is some debug tracing I had enabled,and it indicates that the stack walking code is using the "lastjava frame" setting, which it will do if current registers valuesdon't indicate a valid frame (as would be the case if sp was null).I had previously assumed that without an underling valid LWP, therewould be no stack trace. Given that there is one, there must be avalid LWP. Otherwise I don't see how the stack could have beenwalked. That's another indication that the ptrace failure isspurious in nature.
thanks,

Chris
Cheers,
David
-----
Also, even if you are using something like clhsdb to issuecommands on addresses, if the address is no longer valid for thecommand you are executing, then you would get the appropriateerror when there is an attempt to create a wrapper for it. Idon't know of any command that operates directly on a JavaThread,but I think there are for InstanceKlass. So if you remembered theaddress of an InstanceKlass, and then reattached and tried acommand that takes an InstanceKlass address, you would get anexception when SA tries to create the wrapper for theInsanceKlass if it were no longer a valid address for one.
Chris
David
-----
Chris
David
-----
Chris
Cheers,
David
src/jdk.hotspot.agent/linux/native/libsaproc/LinuxDebuggerLocal.cppsrc/jdk.hotspot.agent/macosx/native/libsaproc/MacosxDebuggerLocal.m
src/jdk.hotspot.agent/windows/native/libsaproc/sawindbg.cpp
-Instead of throwing an exception when the OS ThreadID isinvalid, print a warning.
src/jdk.hotspot.agent/linux/native/libsaproc/ps_proc.c
-Improve a print_debug message
src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/debugger/bsd/BsdThread.javasrc/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/debugger/linux/LinuxThread.javasrc/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/debugger/windbg/amd64/WindbgAMD64Thread.java-Deal with the array of registers read in being null due tothe OS ThreadID not being valid.
src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/debugger/bsd/BsdDebuggerLocal.javasrc/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/debugger/linux/LinuxDebuggerLocal.java-Fix issue with"sun.jvm.hotspot.debugger.DebuggerException" appearingtwice when printing the exception.
thanks,

Chris

Re: RFR(S): 8247533: SA stack walking sometimes fails with sun.jvm.hotspot.debugger.DebuggerException: get_thread_regs failed for a lwp

Reply via email to