>From the traces, the cleanup of a killed user space process (in this case >opensm) is hung in the kernel. IBAL is waiting forever on a reference count >to drop to 0. From the details that were provided, either a large number of >MADs have been leaked or there's a race condition somewhere that prevents AHs >from being freed during the course of normal operation.
> -----Original Message----- > From: Smith, Stan > Sent: Thursday, February 02, 2012 10:55 AM > To: Leonid Keller; Hefty, Sean; Tzachi Dar > Cc: Uri Habusha; ofw_list; Irena Gannon > Subject: RE: opensm stuck upon kill > > Leo, > What are you saying exactly by 'opensm stuck on kill'? More kill info > please. > > Was OpenSM running as a service and via service control you said stop? > OpenSM running as a console application '--console local' and you typed the > 'exit' command? > OpenSM running and you just killed the process? > > Killed how? > > Thanks, > > Stan. > > >-----Original Message----- > >From: Leonid Keller [mailto:[email protected]] > >Sent: Thursday, February 02, 2012 6:42 AM > >To: Leonid Keller; Hefty, Sean; Tzachi Dar; Smith, Stan > >Cc: Uri Habusha; ofw_list; Irena Gannon > >Subject: opensm stuck upon kill > > > >Hi guys, > > > >opensm got stuck upon kill > >I'll try to keep the full dump and will send you if you are interested. > > > >The stuck happens in IBAL upon releasing PD. > > > > nt!DbgBreakPoint > > ibbus!sync_destroy_obj+0xa61 > > ibbus!destroy_obj+0x8ad > > ibbus!async_destroy_obj+0xa4 > > ibbus!ib_dealloc_pd+0x2b6 > > winmad!WmRegRemoveHandler+0xae > >... > > > >PD can't be released because its children AVs are not released: > > > >// from ibbus!sync_destroy_obj > >1: kd> ?? p_obj > >struct _al_obj * 0xa970fbbc > > ... > > +0x080 ref_cnt : 1 > > ... > > +0x0a4 type : 3 //it's AV > > +0x0a8 state : 3 ( CL_DESTROYING ) > > ... > > > >There are 227 children (AVs), which - as far as I understand, are created and > attached to PD upon send_mad. > >There were several applications, that were running at the time of stuck, > opensm was one of them. > >Opensm was killed and has now only one thread, the one which is stuck: > > > > [cda39020 opensm.exe] > > 83c.0003a8 9af686f0 0000002 RUNNING nt!DbgBreakPoint > > ibbus!sync_destroy_obj+0xa61 > > ibbus!destroy_obj+0x8ad > > ibbus!async_destroy_obj+0xa4 > > ibbus!ib_dealloc_pd+0x2b6 > > winmad!WmRegRemoveHandler+0xae > > winmad!WmRegFree+0xe > > winmad!WmProviderCleanup+0x24 > > winmad!WmFileCleanup+0x3a > > > Wdf01000!FxFileObjectFileCleanup::Invoke+0x24 > > Wdf01000!FxPkgGeneral::OnCleanup+0x57 > > Wdf01000!FxPkgGeneral::Dispatch+0xcb > > Wdf01000!FxDevice::Dispatch+0x7f > > nt!IovCallDriver+0x23f > > nt!IofCallDriver+0x1b > > nt!IopCloseFile+0x387 > > nt!ObpDecrementHandleCount+0x146 > > nt!ObpCloseHandleTableEntry+0x234 > > nt!ExSweepHandleTable+0x5f > > nt!ObKillProcess+0x54 > > nt!PspExitThread+0x5b6 > > nt!PsExitSpecialApc+0x22 > > nt!KiDeliverApc+0x1dc > > nt!KiServiceExit+0x56 > > ntdll!KiFastSystemCallRet > > > ntdll!ZwWaitForWorkViaWorkerFactory+0xc > > ntdll!TppWorkerThread+0x1f6 > > kernel32!BaseThreadInitThunk+0xe > > ntdll!__RtlUserThreadStart+0x23 > > ntdll!_RtlUserThreadStart+ > > > >winmad!WmRegRemoveHandler+0xae is standing here: > > > > WmProviderDeregister(pRegistration->pProvider, pRegistration); > > pRegistration->pDevice->IbInterface.destroy_qp(pRegistration->hQp, > NULL); > > pRegistration->pDevice->IbInterface.dealloc_pd(pRegistration->hPd, > NULL); > >> pRegistration->pDevice->IbInterface.close_ca(pRegistration->hCa, NULL); > > > >Could you suggest some idea ? > >Thank you. > > > > > >-----Original Message----- > >From: Leonid Keller > >Sent: Tuesday, January 31, 2012 1:15 PM > >To: 'Hefty, Sean'; Tzachi Dar; Smith, Stan > >Cc: Uri Habusha; ofw_list; Irena Gannon > >Subject: RE: Opensm & WinMad: a race, cauing BSOD722 > > > >Thank you, Sean. > > > >Some comments. > >We do not think that this additional validation is necessary. > >It's hard to believe - unless you saw that - that Windows can call > close(handle) after open(&handle) has failed. > > > >As to the patch to winverbs - it causes a crash, because WvProviderGet is > called at DISPATCH level. > > > >ATTEMPTED_SWITCH_FROM_DPC (b8) > >A wait operation, attach process, or yield was attempted from a DPC routine. > >This is an illegal operation and the stack track will lead to the offending > >code and original DPC routine. > > > >nt!KiSwapContext+0x7f > >nt!KiSwapThread+0x2fa > >nt!KeWaitForGate+0x22a > >nt!KiAcquireGuardedMutex+0x35 > >nt!KeAcquireGuardedMutex+0x39 > >winverbs!WvProviderGet+0x1d > >winverbs!WvEpCompleteDisconnect+0x113 > >winverbs!WvEpIbCmHandler+0x26a > >ibbus!cm_cep_handler+0x99 > >ibbus!__process_cep+0x10f > >ibbus!__drep_handler+0x6ea > >ibbus!__cep_mad_recv_cb+0x246 > >ibbus!__mad_svc_recv_done+0xb58 > >ibbus!mad_disp_recv_done+0x1650 > >ibbus!process_mad_recv+0x3bf > >ibbus!spl_qp_comp+0x3d2 > >ibbus!spl_qp_recv_dpc_cb+0x112 > >nt!KiRetireDpcList+0x117 > >nt!KyRetireDpcList+0x5 > >nt!KiDispatchInterruptContinue > > > >I've replaced mutex by spinlock - see below. > >I did it also for WinMad, albeit it has no asynchronous callbacks like > WinVerbs. > >The main reason is to keep it similar to WinVerbs as it is today. > >A minor, mostly theoretical one: there are other functions, which are using > today the provider mutex. It seems for me worthful to keep for > >them possibility to call a low-level WvProviderGet function. > >What's your opinion ? > > > >Index: B:/users/leonid/svn/winib/trunk/core/winverbs/kernel/wv_provider.c > >=================================================================== > >--- B:/users/leonid/svn/winib/trunk/core/winverbs/kernel/wv_provider.c > (revision 9686) > >+++ B:/users/leonid/svn/winib/trunk/core/winverbs/kernel/wv_provider.c > (revision 9687) > >@@ -44,14 +44,15 @@ > > LONG WvProviderGet(WV_PROVIDER *pProvider) > > { > > LONG val; > >+ KIRQL irql; > > > >- KeAcquireGuardedMutex(&pProvider->Lock); > >+ KeAcquireSpinLock(&pProvider->SpinLock, &irql); > > val = InterlockedIncrement(&pProvider->Ref); > > if (val == 1) { > > pProvider->Ref = 0; > > val = 0; > > } > >- KeReleaseGuardedMutex(&pProvider->Lock); > >+ KeReleaseSpinLock(&pProvider->SpinLock, irql); > > return val; > > } > > > >@@ -119,6 +120,7 @@ > > KeInitializeEvent(&pProvider->SharedEvent, NotificationEvent, FALSE); > > pProvider->Exclusive = 0; > > KeInitializeEvent(&pProvider->ExclusiveEvent, SynchronizationEvent, > FALSE); > >+ KeInitializeSpinLock(&pProvider->SpinLock); > > return STATUS_SUCCESS; > > } > > > >Index: B:/users/leonid/svn/winib/trunk/core/winverbs/kernel/wv_provider.h > >=================================================================== > >--- B:/users/leonid/svn/winib/trunk/core/winverbs/kernel/wv_provider.h > (revision 9686) > >+++ B:/users/leonid/svn/winib/trunk/core/winverbs/kernel/wv_provider.h > (revision 9687) > >@@ -80,6 +80,7 @@ > > KEVENT ExclusiveEvent; > > > > WORK_QUEUE WorkQueue; > >+ KSPIN_LOCK SpinLock; > > > > } WV_PROVIDER; > > > >Index: B:/users/leonid/svn/winib/trunk/core/winmad/kernel/wm_provider.h > >=================================================================== > >--- B:/users/leonid/svn/winib/trunk/core/winmad/kernel/wm_provider.h > (revision 9687) > >+++ B:/users/leonid/svn/winib/trunk/core/winmad/kernel/wm_provider.h > (revision 9688) > >@@ -57,6 +57,7 @@ > > KEVENT SharedEvent; > > LONG Exclusive; > > KEVENT ExclusiveEvent; > >+ KSPIN_LOCK SpinLock; > > > > } WM_PROVIDER; > > > >Index: B:/users/leonid/svn/winib/trunk/core/winmad/kernel/wm_provider.c > >=================================================================== > >--- B:/users/leonid/svn/winib/trunk/core/winmad/kernel/wm_provider.c > (revision 9687) > >+++ B:/users/leonid/svn/winib/trunk/core/winmad/kernel/wm_provider.c > (revision 9688) > >@@ -36,14 +36,15 @@ > > LONG WmProviderGet(WM_PROVIDER *pProvider) > > { > > LONG val; > >+ KIRQL irql; > > > >- KeAcquireGuardedMutex(&pProvider->Lock); > >+ KeAcquireSpinLock(&pProvider->SpinLock, &irql); > > val = InterlockedIncrement(&pProvider->Ref); > > if (val == 1) { > > pProvider->Ref = 0; > > val = 0; > > } > >- KeReleaseGuardedMutex(&pProvider->Lock); > >+ KeReleaseSpinLock(&pProvider->SpinLock, irql); > > return val; > > } > > > >@@ -72,6 +73,7 @@ > > KeInitializeEvent(&pProvider->SharedEvent, NotificationEvent, FALSE); > > pProvider->Exclusive = 0; > > KeInitializeEvent(&pProvider->ExclusiveEvent, SynchronizationEvent, > FALSE); > >+ KeInitializeSpinLock(&pProvider->SpinLock); > > > > ASSERT(ControlDevice != NULL); > > > > > >-----Original Message----- > >From: Hefty, Sean [mailto:[email protected]] > >Sent: Tuesday, January 31, 2012 12:08 AM > >To: Leonid Keller; Tzachi Dar; Smith, Stan > >Cc: Uri Habusha; ofw_list; Irena Gannon > >Subject: RE: Opensm & WinMad: a race, cauing BSOD722 > > > >> Two ideas: > >> WmProviderInit() is called without checking the return status. Is there a > >> reason ? > >> Seems like the similar patch is needed for WvIoDeviceControl(). > > > >I can't tell whether IOCTLs suffer from the same problem or not. But since > Windows is stupid, I went ahead and added the same protection > >to winverbs, plus some additional validation in case we get a cleanup event > for a file for which we failed to create. > > > > > > > > > >- Sean _______________________________________________ ofw mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw
