Leo, This patch, which I believe was committed as svn.4275, works fine if there are no IOUnits in the fabric. Once there is an IOU present, my case a Linux SRP target, this patch hangs HCA disable for a debug version of ibbus.
[AL]bus_release_resources(): Releasing BusFilter bfi-0 [AL]:al_cleanup(): Destroying \ device. [AL]:al_cleanup(): Destroying AL Mgr. [AL]sync_destroy_obj() !ERROR!: Error waiting for references to be released - delaying. [AL]print_al_obj() !ERROR!: AL object 0000000082156200(AL_OBJ_TYPE_AL_MGR), parent: 0000000000000000 ref_cnt: 3 *** Assertion failed: cl_status == CL_SUCCESS *** Source File: f:\openib-windows-svn\latest\gen1\trunk\core\al\al_common.c, line 554 Break repeatedly, break Once, Ignore, terminate Process, or terminate Thread (boipt)? i i [AL]sync_destroy_obj() !ERROR!: Forcing object destruction. [AL]print_al_obj() !ERROR!: AL object 0000000082156200(AL_OBJ_TYPE_AL_MGR), parent: 0000000000000000 ref_cnt: 3 [AL]print_al_obj() !ERROR!: AL object 0000000082175270(AL_OBJ_TYPE_IOC_PNP_MGR), parent: 0000000082156200 ref_cnt: 1 [AL]print_al_obj() !ERROR!: AL object 00000000ff8ca2c0(AL_OBJ_TYPE_IOC_PNP_SVC), parent: 0000000082175270 ref_cnt: 2 [AL]print_al_obj() !ERROR!: AL object 0000000082175270(AL_OBJ_TYPE_IOC_PNP_MGR), parent: 0000000082156200 ref_cnt: 1 [AL]print_al_obj() !ERROR!: AL object 00000000ff8ca2c0(AL_OBJ_TYPE_IOC_PNP_SVC), parent: 0000000082175270 ref_cnt: 2 [AL]:al_cleanup(): Destroying async obj mgr. [AL]:al_cleanup(): Destroying async pnp mgr. [AL]:al_cleanup(): Destroying async proc mgr. [AL]:al_cleanup(): Goodbye Cruel World =( [AL]bus_release_resources() ] Signaled to stop polling. Polling thread terminated. It seems there is a path in IBAL which is not releasing the reference on the IOC PnP service when an IOU is present in the fabric. Perhaps you could suggest a fix? If commit svn.4275 is removed the call to al_cleanup() returns successfully with no errors. thanks, Stan. ________________________________ From: Leonid Keller [mailto:[email protected]] Sent: Monday, April 27, 2009 5:38 AM To: Leonid Keller; Fab Tillier; Smith, Stan Cc: [email protected] Subject: RE: [ofw] crash on IBBUS disabling while mad traffic Here is a possible explanation and a fix. Please, review. __ioc_query_sa takes references on IOC PnP service before sending the node and path_record requests. But these references get released at the end of __node_rec_cb and __path_rec_cb, while __process_sweep routine, which performs the IOU sweeping, is just scheduled to run in an async thread. If the test happens to unload the driver after __node_rec_cb and __path_rec_cb and before __process_sweep started to run, IOC PnP service gets released and __process_sweep crashes. The patch takes a reference on IOC PnP service before scheduling a thread for __process_sweep and releases the reference at the end of __process_sweep. (Pay attention, that __process_sweep schedules a thread for itself twice while moving through its FSM: SWEEP_IOU_INFO --> SWEEP_IOC_PROFILE --> SWEEP_SVC_ENTRIES --> SWEEP_COMPLETE) Index: al/kernel/al_ioc_pnp.c =================================================================== --- al/kernel/al_ioc_pnp.c (revision 3609) +++ al/kernel/al_ioc_pnp.c (working copy) @@ -2231,8 +2231,11 @@ * If this is the last MAD, finish processing the IOU queries * in the PnP thread. */ - if( !cl_atomic_dec( &p_results->p_svc->query_cnt ) ) + if( !cl_atomic_dec( &p_results->p_svc->query_cnt ) ) { + /* Reference the service till the end of processing in the thread */ + ref_al_obj( &p_results->p_svc->obj ); cl_async_proc_queue( gp_async_pnp_mgr, &p_results->async_item ); + } AL_EXIT( AL_DBG_PNP ); } @@ -2354,6 +2357,8 @@ if( !cl_atomic_dec( &gp_ioc_pnp->query_cnt ) ) cl_async_proc_queue( gp_async_pnp_mgr, &gp_ioc_pnp->async_item ); cl_free( p_results ); + /* Release the reference taken for the query. */ + deref_al_obj( &p_results->p_svc->obj ); } AL_EXIT( AL_DBG_PNP ); ________________________________ From: Leonid Keller Sent: Sunday, April 26, 2009 1:05 AM To: 'Fab Tillier'; 'Smith, Stan' Cc: [email protected] Subject: [ofw] crash on IBBUS disabling while mad traffic I've got a crash while running WHQL Disable Enable test while opensm was running on another node. I was running a December version of the driver, but i'm not sure this will work with current one. (i'll try) The test, which makes disable/enable to all devices, passes without opensm. With opensm IBBUS sends SA requests to opensm. In this case __process_sweep() fails, because per-port IOC PnP agent seems to be already released. The latter is strange, because __ioc_query_sa takes reference on PnP agent before sending request. __ioc_query_sa __node_rec_cb __process_query __process_sweep Any ideas ? 3: kd> !analyze -v ERROR: FindPlugIns 8007007b ******************************************************************************* * * * Bugcheck Analysis * * * ******************************************************************************* DRIVER_PAGE_FAULT_IN_FREED_SPECIAL_POOL (d5) Memory was referenced after it was freed. This cannot be protected by try-except. When possible, the guilty driver's name (Unicode string) is printed on the bugcheck screen and saved in KiBugCheckDriver. Arguments: Arg1: fffff98005b72f84, memory referenced Arg2: 0000000000000000, value 0 = read operation, 1 = write operation Arg3: fffffa600400b1d0, if non-zero, the address which referenced memory. Arg4: 0000000000000000, (reserved) Debugging Details: ------------------ Matched: ibbus!proxy_ioctl+0x41 (fffffa60`04031d8d) Matched: ibbus!proxy_ioctl+0xa5 (fffffa60`04031df1) READ_ADDRESS: fffff98005b72f84 Special pool FAULTING_IP: ibbus!__process_sweep+44 [s:\builds\3609\branches\mlnx_winof_2-0\core\al\kernel\al_ioc_pnp.c @ 2315] fffffa60`0400b1d0 83b8d400000003 cmp dword ptr [rax+0D4h],3 MM_INTERNAL_CODE: 0 IMAGE_NAME: ibbus.sys DEBUG_FLR_IMAGE_TIMESTAMP: 49401b3e MODULE_NAME: ibbus FAULTING_MODULE: fffffa6004002000 ibbus DEFAULT_BUCKET_ID: VISTA_DRIVER_FAULT BUGCHECK_STR: 0xD5 PROCESS_NAME: System CURRENT_IRQL: f TRAP_FRAME: fffffa6003d50b00 -- (.trap 0xfffffa6003d50b00) NOTE: The trap frame does not contain all registers. Some register values may be zeroed or incorrect. rax=fffff98005b72eb0 rbx=0000000000000000 rcx=fffffa6004057780 rdx=fffffa6004005e97 rsi=fffffa600199ccc0 rdi=fffff80001cc0304 rip=fffffa600400b1d0 rsp=fffffa6003d50c90 rbp=0000000000000080 r8=0000000000000005 r9=fffffa6004005e97 r10=0000000000000001 r11=fffffa6003d50c50 r12=0000000000000000 r13=0000000000000000 r14=0000000000000000 r15=0000000000000000 iopl=0 nv up ei pl zr na po nc ibbus!__process_sweep+0x44: fffffa60`0400b1d0 83b8d400000003 cmp dword ptr [rax+0D4h],3 ds:fffff980`05b72f84=???????? Resetting default scope LAST_CONTROL_TRANSFER: from fffff80001969c42 to fffff800018b0b30 STACK_TEXT: fffffa60`03d502f8 fffff800`01969c42 : fffffa80`0e0eb290 fffff800`0194893d fffff800`01a55140 00000000`00001000 : nt!RtlpBreakWithStatusInstruction fffffa60`03d50300 fffff800`0196adb7 : fffff800`00000004 fffff800`01a55140 ffffffff`fffff000 00000000`00000050 : nt!KiBugCheckDebugBreak+0x12 fffffa60`03d50360 fffff800`018b6754 : fffffa80`0dd77480 fffff800`01cc2bb9 00000000`00000000 fffff800`0194c13f : nt!KeBugCheck2+0xaa7 fffffa60`03d509d0 fffff800`018c5671 : 00000000`00000050 fffff980`05b72f84 00000000`00000000 fffffa60`03d50b00 : nt!KeBugCheckEx+0x104 fffffa60`03d50a10 fffff800`018b51d9 : 00000000`00000000 fffff980`0427cf78 fffffa80`0e0ecf00 fffff980`1c27ef40 : nt!MmAccessFault+0x1371 fffffa60`03d50b00 fffffa60`0400b1d0 : fffff980`1c27ef40 fffff980`04318e00 fffffa60`04005eba fffff980`04318e78 : nt!KiPageFault+0x119 fffffa60`03d50c90 fffffa60`04005e9d : fffff980`04318e98 fffff980`043bccb0 fffff980`1b88afd0 fffff980`04318e78 : ibbus!__process_sweep+0x44 [s:\builds\3609\branches\mlnx_winof_2-0\core\al\kernel\al_ioc_pnp.c @ 2315] fffffa60`03d50cc0 fffffa60`040070d9 : fffff980`04318d60 fffff980`0434afd0 00000000`00000000 fffffa60`0400743c : ibbus!__cl_async_proc_worker+0x61 [s:\builds\3609\branches\mlnx_winof_2-0\core\complib\cl_async_proc.c @ 153] fffffa60`03d50cf0 fffffa60`04007464 : fffff980`0434afd0 00000000`00000080 fffff980`0434afd0 8b8b8b8b`8b8b8b8b : ibbus!__cl_thread_pool_routine+0x41 [s:\builds\3609\branches\mlnx_winof_2-0\core\complib\cl_threadpool.c @ 66] fffffa60`03d50d20 fffff800`01adafd3 : 8b8b8b8b`8b8b8b8b 8b8b8b8b`8b8b8b8b 8b8b8b8b`8b8b8b8b 8b8b8b8b`8b8b8b01 : ibbus!__thread_callback+0x28 [s:\builds\3609\branches\mlnx_winof_2-0\core\complib\kernel\cl_thread.c @ 49] fffffa60`03d50d50 fffff800`018f0816 : fffffa60`01999180 fffffa80`0e0eb290 fffffa60`019a2d40 00000000`00000001 : nt!PspSystemThreadStartup+0x57 fffffa60`03d50d80 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiStartSystemThread+0x16 STACK_COMMAND: kb FOLLOWUP_IP: ibbus!__process_sweep+44 [s:\builds\3609\branches\mlnx_winof_2-0\core\al\kernel\al_ioc_pnp.c @ 2315] fffffa60`0400b1d0 83b8d400000003 cmp dword ptr [rax+0D4h],3 FAULTING_SOURCE_CODE: 2311: 2312: p_results = PARENT_STRUCT( p_async_item, ioc_sweep_results_t, async_item ); 2313: CL_ASSERT( !p_results->p_svc->query_cnt ); 2314: > 2315: if( p_results->p_svc->obj.state == CL_DESTROYING ) 2316: { 2317: __put_iou_map( gp_ioc_pnp, &p_results->iou_map ); 2318: goto err; 2319: } 2320: SYMBOL_STACK_INDEX: 6 SYMBOL_NAME: ibbus!__process_sweep+44 FOLLOWUP_NAME: MachineOwner FAILURE_BUCKET_ID: X64_0xD5_VRF_ibbus!__process_sweep+44 BUCKET_ID: X64_0xD5_VRF_ibbus!__process_sweep+44 Followup: MachineOwner ---------
Before polling. create_port_mgr() !INFO!: created Port service obj 821AB1C0 create_iou_mgr() !INFO!: created IOU service obj 82135DA8 <2 HCAs are enabled> 'devcon disable =InfinibandController' command executed. [AL]bus_release_resources(): Release Resources for bfi-0 CA guid 8025000002c90200 [AL]bus_release_resources(): ib_deregister bfi-0 CA guid 8025000002c90200 port_mgr_pnp_cb() [ port_mgr_pnp_cb() ] port_mgr_pnp_cb() [ port_mgr_port_remove() [ bfi-0 ca_guid 0x8025000002c90200 port_num 1 port_mgr 821AB1C0 port_mgr_port_remove(): Mark removing IODEVICE: PDO FF765030, ext FF7650E8, present 0, missing 0 . port_mgr_port_remove() ] port_mgr_pnp_cb() ] port_mgr_pnp_cb() [ port_mgr_port_remove() [ bfi-0 ca_guid 0x8025000002c90200 port_num 2 port_mgr 821AB1C0 port_mgr_port_remove(): Mark removing IODEVICE: PDO FF765DD8, ext FF765E90, present 0, missing 0 . port_mgr_port_remove() ] port_mgr_pnp_cb() ] iou_mgr_pnp_cb() [ iou_mgr_iou_remove() [ bfi-0 ca_guid 0x8025000002c90200 iou_mgr 82135DA8 iou_mgr_iou_remove(): bfi-0 IB IOU: ext FF26A4D8, present 0, missing 1 . iou_mgr_iou_remove() ] iou_mgr_pnp_cb() ] [AL]bus_release_resources(): del IOU_mgr *OBJ 82135DA8 destroying_iou_mgr() [ [AL]get_bfi_by_obj(): cl_obj 82135DA8 type IOU_MGR_OBJ --> bfi[0] F567C1E0 destroying_iou_mgr(): bfi-0 p_obj 82135DA8 iou_mgr 82135DA8 iou_mgr_obj 82135DA8 destroying_iou_mgr(): bfi-0 ib_dereg_pnp(IOU) IOU mgr 82181308 destroying_iou_mgr(): bfi-0 deref IOU service 82135DA8 destroying_iou_mgr() ] free_iou_mgr() [ [AL]get_bfi_by_obj(): cl_obj 82135DA8 type IOU_MGR_OBJ --> bfi[0] F567C1E0 free_iou_mgr(): bfi-0 Mark all IOU PDOs as no longer present free_iou_mgr(): *** SKIP deref h_ca 82565378 obj 82565378 free_iou_mgr(): Deleted device bfi-0 IB IOU: PDO FF26A420, ext FF26A4D8 free_iou_mgr(): DEL PDO FF26A420 free_iou_mgr(): bfi-0 Destroyed IOU svc 82135DA8 free_iou_mgr() ] [AL]bus_release_resources(): del PORT_mgr OBJ 821AB1C0 destroying_port_mgr() [ [AL]get_bfi_by_obj(): cl_obj 821AB1C0 type PORT_MGR_OBJ --> bfi[0] F567C1E0 bfi-0 obj 821AB1C0 port_mgr 821AB1C0 port_mgr_obj 821AB1C0 destroying_port_mgr(): bfi-0 ib_dereg_pnp(PORT) PORT mgr 820D0220 destroying_port_mgr() ] free_port_mgr() [ [AL]get_bfi_by_obj(): cl_obj 821AB1C0 type PORT_MGR_OBJ --> bfi[0] F567C1E0 bfi-0 obj 821AB1C0 port_mgr 821AB1C0 port_mgr_obj 821AB1C0 bfi-0 Mark all IPoIB PDOs no longer present free_port_mgr(): bfi-0 Deleted device IODEVICE: PDO FF765030, ext FF7650E8 free_port_mgr(): bfi-0 Deleted device IODEVICE: PDO FF765DD8, ext FF765E90 free_port_mgr() ] [AL]bus_release_resources(): Releasing BusFilter bfi-0 [AL]:al_cleanup(): Destroying \ device. [AL]:al_cleanup(): Destroying AL Mgr. [AL]sync_destroy_obj() !ERROR!: Error waiting for references to be released - delaying. [AL]print_al_obj() !ERROR!: AL object 0000000082156200(AL_OBJ_TYPE_AL_MGR), parent: 0000000000000000 ref_cnt: 3 *** Assertion failed: cl_status == CL_SUCCESS *** Source File: f:\openib-windows-svn\latest\gen1\trunk\core\al\al_common.c, line 554 Break repeatedly, break Once, Ignore, terminate Process, or terminate Thread (boipt)? i i [AL]sync_destroy_obj() !ERROR!: Forcing object destruction. [AL]print_al_obj() !ERROR!: AL object 0000000082156200(AL_OBJ_TYPE_AL_MGR), parent: 0000000000000000 ref_cnt: 3 [AL]print_al_obj() !ERROR!: AL object 0000000082175270(AL_OBJ_TYPE_IOC_PNP_MGR), parent: 0000000082156200 ref_cnt: 1 [AL]print_al_obj() !ERROR!: AL object 00000000ff8ca2c0(AL_OBJ_TYPE_IOC_PNP_SVC), parent: 0000000082175270 ref_cnt: 2 [AL]print_al_obj() !ERROR!: AL object 0000000082175270(AL_OBJ_TYPE_IOC_PNP_MGR), parent: 0000000082156200 ref_cnt: 1 [AL]print_al_obj() !ERROR!: AL object 00000000ff8ca2c0(AL_OBJ_TYPE_IOC_PNP_SVC), parent: 0000000082175270 ref_cnt: 2 [AL]:al_cleanup(): Destroying async obj mgr. [AL]:al_cleanup(): Destroying async pnp mgr. [AL]:al_cleanup(): Destroying async proc mgr. [AL]:al_cleanup(): Goodbye Cruel World =( [AL]bus_release_resources() ] Signaled to stop polling. Polling thread terminated.
_______________________________________________ ofw mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw
