Leo,
  What are you saying exactly by 'opensm stuck on kill'? More kill info please.

Was OpenSM running as a service and via service control you said stop?
OpenSM running as a console application '--console local' and you typed the 
'exit' command?
OpenSM running and you just killed the process?

Killed how?

Thanks,

Stan.

>-----Original Message-----
>From: Leonid Keller [mailto:[email protected]]
>Sent: Thursday, February 02, 2012 6:42 AM
>To: Leonid Keller; Hefty, Sean; Tzachi Dar; Smith, Stan
>Cc: Uri Habusha; ofw_list; Irena Gannon
>Subject: opensm stuck upon kill
>
>Hi guys,
>
>opensm got stuck upon kill
>I'll try to keep the full dump and will send you if you are interested.
>
>The stuck happens in IBAL upon releasing PD.
>
> nt!DbgBreakPoint
> ibbus!sync_destroy_obj+0xa61
> ibbus!destroy_obj+0x8ad
> ibbus!async_destroy_obj+0xa4
> ibbus!ib_dealloc_pd+0x2b6
> winmad!WmRegRemoveHandler+0xae
>...
>
>PD can't be released because its children AVs are not released:
>
>// from ibbus!sync_destroy_obj
>1: kd> ?? p_obj
>struct _al_obj * 0xa970fbbc
>   ...
>   +0x080 ref_cnt          : 1
>   ...
>   +0x0a4 type             : 3         //it's AV
>   +0x0a8 state            : 3 ( CL_DESTROYING )
>   ...
>
>There are 227 children (AVs), which - as far as I understand, are created and 
>attached to PD upon send_mad.
>There were several applications, that were running at the time of stuck, 
>opensm was one of them.
>Opensm was killed and has now only one thread, the one which is stuck:
>
>                          [cda39020 opensm.exe]
> 83c.0003a8  9af686f0 0000002 RUNNING    nt!DbgBreakPoint
>                                        ibbus!sync_destroy_obj+0xa61
>                                        ibbus!destroy_obj+0x8ad
>                                        ibbus!async_destroy_obj+0xa4
>                                        ibbus!ib_dealloc_pd+0x2b6
>                                        winmad!WmRegRemoveHandler+0xae
>                                        winmad!WmRegFree+0xe
>                                        winmad!WmProviderCleanup+0x24
>                                        winmad!WmFileCleanup+0x3a
>                                        
> Wdf01000!FxFileObjectFileCleanup::Invoke+0x24
>                                        Wdf01000!FxPkgGeneral::OnCleanup+0x57
>                                        Wdf01000!FxPkgGeneral::Dispatch+0xcb
>                                        Wdf01000!FxDevice::Dispatch+0x7f
>                                        nt!IovCallDriver+0x23f
>                                        nt!IofCallDriver+0x1b
>                                        nt!IopCloseFile+0x387
>                                        nt!ObpDecrementHandleCount+0x146
>                                        nt!ObpCloseHandleTableEntry+0x234
>                                        nt!ExSweepHandleTable+0x5f
>                                        nt!ObKillProcess+0x54
>                                        nt!PspExitThread+0x5b6
>                                        nt!PsExitSpecialApc+0x22
>                                        nt!KiDeliverApc+0x1dc
>                                        nt!KiServiceExit+0x56
>                                        ntdll!KiFastSystemCallRet
>                                        ntdll!ZwWaitForWorkViaWorkerFactory+0xc
>                                        ntdll!TppWorkerThread+0x1f6
>                                        kernel32!BaseThreadInitThunk+0xe
>                                        ntdll!__RtlUserThreadStart+0x23
>                                        ntdll!_RtlUserThreadStart+
>
>winmad!WmRegRemoveHandler+0xae is standing here:
>
>       WmProviderDeregister(pRegistration->pProvider, pRegistration);
>       pRegistration->pDevice->IbInterface.destroy_qp(pRegistration->hQp, 
> NULL);
>       pRegistration->pDevice->IbInterface.dealloc_pd(pRegistration->hPd, 
> NULL);
>>      pRegistration->pDevice->IbInterface.close_ca(pRegistration->hCa, NULL);
>
>Could you suggest some idea ?
>Thank you.
>
>
>-----Original Message-----
>From: Leonid Keller
>Sent: Tuesday, January 31, 2012 1:15 PM
>To: 'Hefty, Sean'; Tzachi Dar; Smith, Stan
>Cc: Uri Habusha; ofw_list; Irena Gannon
>Subject: RE: Opensm & WinMad: a race, cauing BSOD722
>
>Thank you, Sean.
>
>Some comments.
>We do not think that this additional validation is necessary.
>It's hard to believe - unless you saw that - that Windows can call 
>close(handle) after open(&handle) has failed.
>
>As to the patch to winverbs - it causes a crash, because WvProviderGet is 
>called at DISPATCH level.
>
>ATTEMPTED_SWITCH_FROM_DPC (b8)
>A wait operation, attach process, or yield was attempted from a DPC routine.
>This is an illegal operation and the stack track will lead to the offending
>code and original DPC routine.
>
>nt!KiSwapContext+0x7f
>nt!KiSwapThread+0x2fa
>nt!KeWaitForGate+0x22a
>nt!KiAcquireGuardedMutex+0x35
>nt!KeAcquireGuardedMutex+0x39
>winverbs!WvProviderGet+0x1d
>winverbs!WvEpCompleteDisconnect+0x113
>winverbs!WvEpIbCmHandler+0x26a
>ibbus!cm_cep_handler+0x99
>ibbus!__process_cep+0x10f
>ibbus!__drep_handler+0x6ea
>ibbus!__cep_mad_recv_cb+0x246
>ibbus!__mad_svc_recv_done+0xb58
>ibbus!mad_disp_recv_done+0x1650
>ibbus!process_mad_recv+0x3bf
>ibbus!spl_qp_comp+0x3d2
>ibbus!spl_qp_recv_dpc_cb+0x112
>nt!KiRetireDpcList+0x117
>nt!KyRetireDpcList+0x5
>nt!KiDispatchInterruptContinue
>
>I've replaced mutex by spinlock - see below.
>I did it also for WinMad, albeit it has no asynchronous callbacks like 
>WinVerbs.
>The main reason is to keep it similar to WinVerbs as it is today.
>A minor, mostly theoretical one: there are other functions, which are using 
>today the provider mutex. It seems for me worthful to keep for
>them possibility to call a low-level WvProviderGet function.
>What's your opinion ?
>
>Index: B:/users/leonid/svn/winib/trunk/core/winverbs/kernel/wv_provider.c
>===================================================================
>--- B:/users/leonid/svn/winib/trunk/core/winverbs/kernel/wv_provider.c 
>(revision 9686)
>+++ B:/users/leonid/svn/winib/trunk/core/winverbs/kernel/wv_provider.c 
>(revision 9687)
>@@ -44,14 +44,15 @@
> LONG WvProviderGet(WV_PROVIDER *pProvider)
> {
>       LONG val;
>+      KIRQL irql;
>
>-      KeAcquireGuardedMutex(&pProvider->Lock);
>+      KeAcquireSpinLock(&pProvider->SpinLock, &irql);
>       val = InterlockedIncrement(&pProvider->Ref);
>       if (val == 1) {
>               pProvider->Ref = 0;
>               val = 0;
>       }
>-      KeReleaseGuardedMutex(&pProvider->Lock);
>+      KeReleaseSpinLock(&pProvider->SpinLock, irql);
>       return val;
> }
>
>@@ -119,6 +120,7 @@
>       KeInitializeEvent(&pProvider->SharedEvent, NotificationEvent, FALSE);
>       pProvider->Exclusive = 0;
>       KeInitializeEvent(&pProvider->ExclusiveEvent, SynchronizationEvent, 
> FALSE);
>+      KeInitializeSpinLock(&pProvider->SpinLock);
>       return STATUS_SUCCESS;
> }
>
>Index: B:/users/leonid/svn/winib/trunk/core/winverbs/kernel/wv_provider.h
>===================================================================
>--- B:/users/leonid/svn/winib/trunk/core/winverbs/kernel/wv_provider.h 
>(revision 9686)
>+++ B:/users/leonid/svn/winib/trunk/core/winverbs/kernel/wv_provider.h 
>(revision 9687)
>@@ -80,6 +80,7 @@
>       KEVENT                  ExclusiveEvent;
>
>       WORK_QUEUE              WorkQueue;
>+      KSPIN_LOCK              SpinLock;
>
> }     WV_PROVIDER;
>
>Index: B:/users/leonid/svn/winib/trunk/core/winmad/kernel/wm_provider.h
>===================================================================
>--- B:/users/leonid/svn/winib/trunk/core/winmad/kernel/wm_provider.h   
>(revision 9687)
>+++ B:/users/leonid/svn/winib/trunk/core/winmad/kernel/wm_provider.h   
>(revision 9688)
>@@ -57,6 +57,7 @@
>       KEVENT                          SharedEvent;
>       LONG                            Exclusive;
>       KEVENT                          ExclusiveEvent;
>+      KSPIN_LOCK                      SpinLock;
>
> }     WM_PROVIDER;
>
>Index: B:/users/leonid/svn/winib/trunk/core/winmad/kernel/wm_provider.c
>===================================================================
>--- B:/users/leonid/svn/winib/trunk/core/winmad/kernel/wm_provider.c   
>(revision 9687)
>+++ B:/users/leonid/svn/winib/trunk/core/winmad/kernel/wm_provider.c   
>(revision 9688)
>@@ -36,14 +36,15 @@
> LONG WmProviderGet(WM_PROVIDER *pProvider)
> {
>       LONG val;
>+      KIRQL irql;
>
>-      KeAcquireGuardedMutex(&pProvider->Lock);
>+      KeAcquireSpinLock(&pProvider->SpinLock, &irql);
>       val = InterlockedIncrement(&pProvider->Ref);
>       if (val == 1) {
>               pProvider->Ref = 0;
>               val = 0;
>       }
>-      KeReleaseGuardedMutex(&pProvider->Lock);
>+      KeReleaseSpinLock(&pProvider->SpinLock, irql);
>       return val;
> }
>
>@@ -72,6 +73,7 @@
>       KeInitializeEvent(&pProvider->SharedEvent, NotificationEvent, FALSE);
>       pProvider->Exclusive = 0;
>       KeInitializeEvent(&pProvider->ExclusiveEvent, SynchronizationEvent, 
> FALSE);
>+      KeInitializeSpinLock(&pProvider->SpinLock);
>
>       ASSERT(ControlDevice != NULL);
>
>
>-----Original Message-----
>From: Hefty, Sean [mailto:[email protected]]
>Sent: Tuesday, January 31, 2012 12:08 AM
>To: Leonid Keller; Tzachi Dar; Smith, Stan
>Cc: Uri Habusha; ofw_list; Irena Gannon
>Subject: RE: Opensm & WinMad: a race, cauing BSOD722
>
>> Two ideas:
>> WmProviderInit() is called without checking the return status. Is there a
>> reason ?
>> Seems like the similar patch is needed for WvIoDeviceControl().
>
>I can't tell whether IOCTLs suffer from the same problem or not.  But since 
>Windows is stupid, I went ahead and added the same protection
>to winverbs, plus some additional validation in case we get a cleanup event 
>for a file for which we failed to create.
>
>
>
>
>- Sean
_______________________________________________
ofw mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw

Reply via email to