Hi David, Thank you for the concerns! Testing showed several tests failing with deadlocks. Scenarios are similar to that you describe.
Trying to understand the details. Thanks, Serguei On 11/4/14 4:09 PM, David Holmes wrote:
Hi Serguei, On 3/11/2014 5:07 PM, serguei.spit...@oracle.com wrote:On 11/2/14 8:58 PM, David Holmes wrote:On 1/11/2014 8:13 PM, Dmitry Samersoff wrote:Serguei, Thank you for good finding. This approach looks much better for me. The fix looks good. Is it necessary to release vmDeathLock locks at eventHandler.c:1244 before call EXIT_ERROR(error,"Can't clear event callbacks on vm death"); ?I agree this looks necessary, or at least more clean (if things are failing we really don't know what is happening).Agreed (replied to Dmitry).More generally I'm concerned about whether any of the code paths taken while holding the new lock can result in deadlock - in particular with regard to the resumeLock ?The cbVMDeath() function never holds both vmDeathLock and resumeLock at the same time, so there is no chance for a deadlock that involves both these locks. Two more locks used in the cbVMDeath() are the callbackBlock and callbackLock. These two locks look completely unrelated to the debugLoop_run(). The debugLoop_run() function also uses the cmdQueueLock. The debugLoop_run() never holds both vmDeathLock and cmdQueueLock at the same time. So that I do not see any potential to introduce new deadlock with the vmDeathLock. However, it is still easy to overlook something here. Please, let me know if you see any danger.I was mainly concerned about what might happen in the call chain for threadControl_resumeAll() (it certainly sounds like it might need to use a resumeLock :) ). I see direct use of the threadLock and indirectly the eventHandler lock; but there are further call paths I did not explore. Wish there was an easy way to determine the transitive closure of all locks used from a given call.Thanks, DavidThanks, SergueiDavid-Dmitry On 2014-11-01 00:07, serguei.spit...@oracle.com wrote:It is 3-rd round of review for: https://bugs.openjdk.java.net/browse/JDK-6988950 New webrev:http://cr.openjdk.java.net/~sspitsyn/webrevs/2014/jdk/6988950-JDWP-wrong-phase.3/Summary For failing scenario, please, refer to the 1-st round RFR below.I've found what is missed in the jdwp agent shutdown and decided toswitch from a workaround to a real fix.The agent VM_DEATH callback sets the gdata field: gdata->vmDead = 1.The agent debugLoop_run() has a guard against the VM shutdown: 165 } else if (gdata->vmDead && 166 ((cmd->cmdSet) != JDWP_COMMAND_SET(VirtualMachine))) { 167 /* Protect the VM from calls while dead. 168 * VirtualMachine cmdSet quietly ignores some cmds169 * after VM death, so, it sends it's own errors.170 */ 171 outStream_setError(&out, JDWP_ERROR(VM_DEAD)); However, the guard above does not help much if the VM_DEATH event happens in the middle of a command execution. There is a lack of synchronization here. The fix introduces new lock (vmDeathLock) which does not allow to execute the commands and the VM_DEATH event callback concurrently. It should work well for any function that is used in implementation of the JDWP_COMMAND_SET(VirtualMachine) . Testing:Run nsk.jdi.testlist, nsk.jdwp.testlist and JTREG com/sun/jdi testsThanks, Serguei On 10/29/14 6:05 PM, serguei.spit...@oracle.com wrote:The updated webrev:http://cr.openjdk.java.net/~sspitsyn/webrevs/2014/jdk/6988950-JDWP-wrong-phase.2/The changes are: - added a comment recommended by Staffan - removed the ignore_wrong_phase() call from function classSignature() The classSignature() function is called in 16 places. Most of them do not tolerate the NULL in place of returned signature and will crash. I'm not comfortable to fix all the occurrences now and suggest to return to thisissue after gaining experience with more failure cases that are stillexpected.The failure with the classSignature() involved was observed only oncein the nightly and should be extremely rare reproducible. I'll file a placeholder bug if necessary. Thanks, Serguei On 10/28/14 6:11 PM, serguei.spit...@oracle.com wrote:Please, review the fix for: https://bugs.openjdk.java.net/browse/JDK-6988950 Open webrev:http://cr.openjdk.java.net/~sspitsyn/webrevs/2014/jdk/6988950-JDWP-wrong-phase.1/Summary: The failing scenario:The debugger and the debuggee are well aware a VM shutdown hasbeen started in the target process.The debugger at this point is not expected to send any commandsto the JDWP agent. However, the JDI layer (debugger side) and the jdwp agent (debuggee side) are not in sync with the consumer layers.One reason is because the test debugger does not invoke the JDImethod VirtualMachine.dispose().Another reason is that the Debugger and the debuggee processesare uneasy to sync in general. As a result the following steps are possible: - The test debugger sends a 'quit' command to the test debuggee - The debuggee is normally exiting - The jdwp backend reports (over the jdwp protocol) an anonymous class unload event - The JDI InternalEventHandler thread handles the ClassUnloadEvent event - The InternalEventHandler wants to uncache the matching reference type. If there is more than one class with the same host class signature, it can't distinguish them,and so, deletes all references and re-retrieves them again(see tracing below): MY_TRACE: JDI: VirtualMachineImpl.retrieveClassesBySignature: sig=Ljava/lang/invoke/LambdaForm$DMH;- The jdwp backend debugLoop_run() gets the command from JDIand calls the functions classesForSignature() and classStatus() recursively. - The classStatus() makes a call to the JVMTI GetClassStatus() and gets the JVMTI_ERROR_WRONG_PHASE- As a result the jdwp backend reports the JVMTI error to theJDI, and so, the test failsFor details, see the analysis in bug report closed as a dup ofthe bug 6988950: https://bugs.openjdk.java.net/browse/JDK-8024865Some similar cases can be found in the two bug reports (6988950and 8024865) describing this issue.The fix is to skip reporting the JVMTI_ERROR_WRONG_PHASE erroras it is normal at the VM shutdown.The original jdwp backend implementation had a similar approachfor the raw monitor functions. Threy use the ignore_vm_death() to workaround the JVMTI_ERROR_WRONG_PHASE errors. For reference, please, see the file: src/share/back/util.c Testing: Run nsk.jdi.testlist, nsk.jdwp.testlist and JTREG com/sun/jdi tests Thanks, Serguei