Re: RFR(S): 8247533: SA stack walking sometimes fails with sun.jvm.hotspot.debugger.DebuggerException: get_thread_regs failed for a lwp

Chris Plummer Wed, 24 Jun 2020 21:10:24 -0700

On 6/24/20 6:53 PM, Yasumasa Suenaga wrote:

On 2020/06/25 10:00, Chris Plummer wrote:
On 6/24/20 5:17 PM, Yasumasa Suenaga wrote:
On 2020/06/25 3:22, Chris Plummer wrote:
On 6/24/20 12:01 AM, Yasumasa Suenaga wrote:
On 2020/06/24 15:32, Chris Plummer wrote:
Hi Yasumasa ,
I think LinuxAMD64CFrame is used for pstack and what I've beenlooking at has been jstack, and in particularAMD64CurrentFrameGuess, which does use "last java frame". I'm notsure why LinuxAMD64CFrame does not look at "last java frame".Maybe it should.
I thought both pattern (jstack, mixed stack) for this change.
As you know, mixed jstack (jstack --mixed) attempt to find top ofnative stack via LinuxAMD64CFrame, register values are needed forit (so it depends on ptrace() call). So I guess mixed mode jstack(jhsdb jstack --mixed) would not show any stacks (cannot find"last java frame").
Hi Yasumasa,
I should have been more clear on what I meant by jstack and pstack.For jstack I meant using StackTrace.java, which is what you get bydefault with "jhsdb jstack" and also the clhsdb jstack command. Forpstack I meant PStack.java, which is what you get with "jhsdbjstack --mixed" or the clhsdb pstack command.
So this CR impacts both types of stack traces in that they will getnull registers when the the lower level API fails to get theregister set. For StackTrace.java it will then defer to "last javaframe" if available. For PStack.java it will not, and will alwaysresult in no stack trace. The code of interest is here:
AMD64ThreadContext context = (AMD64ThreadContext)thread.getContext(); Address pc =context.getRegisterAsAddress(AMD64ThreadContext.RIP);
        if (pc == null) return null;
        return LinuxAMD64CFrame.getTopFrame(dbg, pc, context);
So the question is should "last java frame" be used if pc == null.If so, then getTopFrame() would also need to be modified to use"last java frame" when fetching RBP.
I don't think so because CFrame is defined as "Models a "C"programming language frame on the stack" in the javadoc, so itshould have *valid* register values IMHO.In addition, RIP is needed for Linux AMD64 at least because it woulduse DWARF since JDK-8234624.
Hi Yasumasa,
I don't quite understand the "C" frame nomenclature since CFrame isused for non C frames also. The PStack code roughly does the following:
CFrame f = cdbg.topFrameForThread();
ClosestSymbol sym = f.closestSymbolToPC();
Address pc = f.pc();
if (sym != null) {
    ... native symbol
} else if (interp.contains(pc)) {
    ... print interpreter frame
So if the CFrame was filled in with "last java frame" values, itshould allow PStack to print the stack starting with the "last javaframe". Any native frame below that point would be missed.
To use "last java frame" in this case looks good because stackunwinding is a best effort behavior.However PStack::run is PC-driven. I want to regard it - in otherwords, it should not perform if we cannot get register values even if"last java frame" is available.

Ok, that sounds reasonable.

thanks,

Chris

Thanks,

Yasumasa
Chris
Thanks,

Yasumasa
thanks,

Chris
Thanks,

Yasumasa
thanks,

Chris

On 6/23/20 11:04 PM, Yasumasa Suenaga wrote:
Hi Chris,

Thanks you for explanation.
Your change looks good (but "last java frame" would not be foundin Linux AMD64 because RSP is NULL - cf. LinuxAMD64CFrame.java)
Thanks,

Yasumasa


On 2020/06/24 12:09, Chris Plummer wrote:
On 6/23/20 6:05 PM, Yasumasa Suenaga wrote:
Hi Chris,
Skillful troubleshooters who use jhsdb will aware thiswarnings, and they will take other appropriate methods.
However, I'm not sure it is worth to continue to perform evenif SA cannot get register values.
For example, Linux AMD64 depends on RIP and RSP values to findtop frame.According to your change, The caller ofgetThreadIntegerRegisterSet() has responsible for dealing withthe set of null registers. However X86ThreadContext::data (itincludes raw register values) would still be zero when ithappens.
This is what I intended to have happen. Just end up with aregister set of all nulls. Then when stack walking code gets anull, it will revert to "last java frame" if available,otherwise no stack dump is done.
So I think register holder (e.g. X86ThreadContext) should havetri-state (have registers, fail to get registers, not yetattempt to get registers).
OTOH it might be over-engineering. What do you think?
Before implementing this I looked at the what would be theeasier approach to get the desired effect of stack walking codesimply failing over to using "last java frame", and decided thenull set of registers was easiest. Other approaches involvedmore changes and impacted more files.
thanks,

Chris
Thanks,

Yasumasa


On 2020/06/24 3:16, Chris Plummer wrote:
On 6/20/20 12:53 AM, Yasumasa Suenaga wrote:
Hi Chris,

On 2020/06/20 15:20, Chris Plummer wrote:
Hi Yasumasa,
ptrace is not used for core files, so the EFAULT for a badcore file is not a possibility. However, get_lwp_regs()does redirect to core_get_lwp_regs() for core files. It canfail, but the only reason it ever does is if the LWP can'tbe found in the core (which is never suppose to happen). Iwould think if this happened due to the core beingtruncated, SA would be blowing up all over the place withexceptions, probably before we ever get to this code, butin any cast what we do here wouldn't really make a difference.
You are right, sorry.
I'm not sure why you prefer an exception for errors otherthan ESRCH. Why should they be treated differently?getThreadIntegerRegisterSet0() is used for finding thecurrent frame for stack tracing. With my changes anyfailure will result in deferring to "last java frame" ifset, and otherwise just not produce a stack trace (and theWARNING will be present in the output). This seemspreferable to completely abandoning any further threadstack tracking.
I'm not sure we can trust call stack when ptrace() returnsany errors other than ESRCH even if "last java frame" isavailable. For example, don't ptrace() return EFAULT or EIOwhen something wrong? (e.g. stack corruption) If so, it maylead to a wrong analysis for troubleshooter.I think it should be abort dumping call stack for its threadat least.
Hi Yasumasa,
In general stack walking makes a best effort and can bewrong, even when not getting errors like this. For anyactively executing thread SA needs to determine where thestack starts, with register contents being the starting point(SP, FP, and PC). These registers could contain anything, andSA makes a best effort to determine a current frame fromthem. However, the verification steps it takes are not 100%guaranteed, and can lead to an incorrect assumption of thecurrent frame, which in turn can result in an exception lateron when walking the stack. See JDK-8247641.
Keep in mind that the WARNING message will always be there.This should be enough to put the troubleshooter on alert thatthe stack trace may not be accurate. I think it's better tomake an attempt at a stack trace then to just abandon it andnot attempt to do something that may be useful.
thanks,

Chris
Thanks,

Yasumasa
thanks,

Chris

On 6/19/20 6:33 PM, Yasumasa Suenaga wrote:
Hi Chris,
I checked Linux kernel code at a glance, ESRCH seems to beset to errno by default.
So I guess it is similar to "generic" error code.

https://github.com/torvalds/linux/blob/master/kernel/ptrace.c
According to manpage of ptrace(2), it might return errnoother than ESRCH.For example, if we analyze broken core (e.g. the core wasdumped with disk full), we might get EFAULT.Thus I prefer to handle ESRCH only in your patch, and alsoI think SA should throw DebuggerException if other erroris occurred.
https://www.man7.org/linux/man-pages/man2/ptrace.2.html


Thanks,

Yasumasa


On 2020/06/20 5:51, Chris Plummer wrote:
Hello,
I've updated with webrev based on the new finding that aJavaThread cannot be on the ThreadList after its OSthread has been destroyed since the JavaThread removesitself from the ThreadList, and therefore must be runningon its OS thread. The logic of the fix is unchanged fromthe first webrev, but I updated the comments to betterreflect what is going on. I also updated the CR:
https://bugs.openjdk.java.net/browse/JDK-8247533
http://cr.openjdk.java.net/~cjplummer/8247533/webrev.01/index.html
thanks,

Chris

On 6/19/20 12:24 AM, David Holmes wrote:
Hi Chris,

On 19/06/2020 8:55 am, Chris Plummer wrote:
On 6/18/20 1:43 AM, David Holmes wrote:
On 18/06/2020 4:49 pm, Chris Plummer wrote:
On 6/17/20 10:29 PM, David Holmes wrote:
On 18/06/2020 3:13 pm, Chris Plummer wrote:
On 6/17/20 10:09 PM, David Holmes wrote:
On 18/06/2020 2:33 pm, Chris Plummer wrote:
On 6/17/20 7:43 PM, David Holmes wrote:
Hi Chris,

On 18/06/2020 6:34 am, Chris Plummer wrote:
Hello,

Please help review the following:

https://bugs.openjdk.java.net/browse/JDK-8247533
http://cr.openjdk.java.net/~cjplummer/8247533/webrev.00/index.html
The CR contains all the needed details. Here'sa summary of changes in each file:
The problem sounds to me like a variation of themore general problem of not ensuring a thread iskept alive whilst acting upon it. I don't knowhow the SA finds these references to the threadsit is going to stackwalk, but is it possible tofix this via appropriate uses ofThreadsListHandle/Iterator?
It fetches ThreadsSMRSupport::_java_thread_list.
Keep in mind that once SA attaches, nothing inthe VM changes. For example, SA can't create awrapper to a JavaThread, only to have theJavaThread be freed later on. It's just notpossible.
Then how does it obtain a reference to aJavaThread for which the native OS thread id isinvalid? Any thread found in _java_thread_list iseither live or still to be started. In the lattercase the JavaThread->osThread does not have itsthread_id set yet.
My assumption was that the JavaThread is in theprocess of being destroyed, and it has freed its OSthread but is itself still in the thread list. Idid notice that the OS thread id being used lookedto be in the range of thread id #'s you wouldexpect for the running app, so that to me indicatedit was once valid, but is no more.
Keep in mind that although hotspot may havesynchronization code that prevents you from pullinga JavaThread off the thread list when it is in theprocess of being destroyed (I'm guessing it does),SA has no such protections.
But you stated that once the SA has attached, thetarget VM can't change. If the SA gets its set ofthread from one attach then tries to make queriesabout those threads in a separate attach, thenobviously it could be providing garbage threadinformation. So you would need to re-validate theJavaThread in the target VM before trying to doanything with it.
That's not what is going on here. It's attaching anddoing a stack trace, which involves getting thethread list and iterating through all threads withoutdetaching.
Okay so I restate my original comment - all theJavaThreads must be alive or not yet started, so howare you encountering an invalid thread id? Any threadyou find via the ThreadsList can't have destroyed itsosThread. In any case the logic should be checkingthread->osThread() for NULL, and thenosThread()->get_state() to ensure it is >= INITIALIZEDbefore using the thread_id().
Hi David,
I chatted with Dan about this, and he said since theJavaThread is responsible for removing itself from theThreadList, it is impossible to have a JavaThread stillon the ThreadList, but without and underlying OSThread. So I'm a bit perplexed as to how I can find aJavaThread on the ThreadList, but that results in ESRCHwhen trying to access the thread with ptrace. My onlyconclusion is that this failure is somehow spurious,and maybe the issue it just that the thread is in sometemporary state that prevents its access. If so, Istill think the approach I'm taking is the correct one,but the comments should be updated.
ESRCH can have other meanings but I don't know enoughabout the broader context to know whether they areapplicable in this case.
ESRCH The specified process does not exist, or isnot currently being traced by the caller, or is not stopped
              (for requests that require a stopped tracee).
I won't comment further on the fix/workaround as I don'tknow the code. I'll leave that to other folk.
Cheers,
David
-----
I had one other finding. When this issue first turnedup, it prevented the thread from getting a stack tracedue to the exception being thrown. What I hadn'trealize is that after fixing it to not throw anexception, which resulted in the stack walking codegetting all nulls for register values, I actuallystarted to see a stack trace printed:
"JLine terminal non blocking reader thread" #26 daemonprio=5 tid=0x00007f12f0cd6420 nid=0x1f99 runnable[0x00007f125f0f4000]
    java.lang.Thread.State: RUNNABLE
    JavaThread state: _thread_in_native
WARNING: getThreadIntegerRegisterSet0: get_lwp_regsfailed for lwp (8089)CurrentFrameGuess: choosing last Java frame: sp =0x00007f125f0f4770, fp = 0x00007f125f0f47c0 - java.io.FileInputStream.read0() @bci=0 (Interpretedframe) - java.io.FileInputStream.read() @bci=1, line=223(Interpreted frame) -jdk.internal.org.jline.utils.NonBlockingInputStreamImpl.run()@bci=108, line=216 (Interpreted frame) -jdk.internal.org.jline.utils.NonBlockingInputStreamImpl$$Lambda$536+0x0000000800daeca0.run()@bci=4 (Interpreted frame) - java.lang.Thread.run() @bci=11, line=832(Interpreted frame)
The "CurrentFrameGuess" output is some debug tracing Ihad enabled, and it indicates that the stack walkingcode is using the "last java frame" setting, which itwill do if current registers values don't indicate avalid frame (as would be the case if sp was null). Ihad previously assumed that without an underling validLWP, there would be no stack trace. Given that there isone, there must be a valid LWP. Otherwise I don't seehow the stack could have been walked. That's anotherindication that the ptrace failure is spurious in nature.
thanks,

Chris
Cheers,
David
-----
Also, even if you are using something like clhsdb toissue commands on addresses, if the address is nolonger valid for the command you are executing, thenyou would get the appropriate error when there is anattempt to create a wrapper for it. I don't know ofany command that operates directly on a JavaThread,but I think there are for InstanceKlass. So if youremembered the address of an InstanceKlass, and thenreattached and tried a command that takes anInstanceKlass address, you would get an exceptionwhen SA tries to create the wrapper for theInsanceKlass if it were no longer a valid address forone.
Chris
David
-----
Chris
David
-----
Chris
Cheers,
David
src/jdk.hotspot.agent/linux/native/libsaproc/LinuxDebuggerLocal.cppsrc/jdk.hotspot.agent/macosx/native/libsaproc/MacosxDebuggerLocal.msrc/jdk.hotspot.agent/windows/native/libsaproc/sawindbg.cpp-Instead of throwing an exception when the OSThreadID is invalid, print a warning.
src/jdk.hotspot.agent/linux/native/libsaproc/ps_proc.c
-Improve a print_debug message
src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/debugger/bsd/BsdThread.javasrc/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/debugger/linux/LinuxThread.javasrc/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/debugger/windbg/amd64/WindbgAMD64Thread.java-Deal with the array of registers read in beingnull due to the OS ThreadID not being valid.
src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/debugger/bsd/BsdDebuggerLocal.javasrc/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/debugger/linux/LinuxDebuggerLocal.java-Fix issue with"sun.jvm.hotspot.debugger.DebuggerException"appearing twice when printing the exception.
thanks,

Chris

Re: RFR(S): 8247533: SA stack walking sometimes fails with sun.jvm.hotspot.debugger.DebuggerException: get_thread_regs failed for a lwp

Reply via email to