Thanks Phil, will do. There appears to be a memory leak of some of the 
internal state machine data structures when an IO op is cancelled, still 
looking at that.

Michael

On Mon, Feb 08, 2010 at 11:32:42AM -0500, Phil Carns wrote:
> Hi Michael,
> 
> Thanks for continuing to dig into that.  The patch looks great to me.
> 
> You can go ahead and apply it straight to trunk if you don't mind.
> 
> thanks!
> -Phil
> 
> Michael Moore wrote:
> > Attached is a patch that, so far, has resolved the cancel IO issues 
> > we've been seeing. 
> > 
> > In completion_list_retrieve_completed a cancelled IO operation gets the 
> > base frame's user_ptr assigned to the user_ptr_array (which is the 
> > vfs_request array used back in process_vfs_request). This change stops 
> > the segfaults in process_vfs_requests. Then, in PINT_client_io_cancel 
> > the references to the contexts come from sm_base_p instead of sm_p. That 
> > ensures the context_count is correct and the context structure has the 
> > correct data. Without this change it's looking at a frame without this 
> > data.
> > 
> > Let me know if this looks okay, if so, can you apply it (or give me an 
> > okay to apply it) to head?
> > 
> > Thanks,
> > Michael
> > 
> > On Wed, Feb 03, 2010 at 04:38:43PM -0500, Michael Moore wrote:
> >> Hi Phil,
> >>
> >> We're still seeing some issues around cancellation. One case I noticed, 
> >> but am finding hard to replicate, is when the sys-io state machine is in 
> >> the unstuff_xfer_msgpair state and has jumped to pvfs2_msgpairarray_sm. 
> >> For that state there will be a similar issue with a non I/O frame on the 
> >> stack, correct? The cases I've seen are when gibberish context counts 
> >> get printed such as the below and are followed by a segfault when 
> >> accessing cur_ctx.
> >>
> >> [D 15:51:00.658599] PINT_client_io_cancel id 7707
> >> [D 15:51:00.658639] base frame is at index: -1
> >> [D 15:51:00.658648] PINT_client_io_cancel: sm_p->u.io.context_count: 
> >> 8958368
> >> [D 15:51:00.658657] PINT_client_io_cancel: iteration i: 0
> >>
> >> #0  PINT_client_io_cancel (id=7707) 
> >>     at src/client/sysint/client-state-machine.c:548
> >> #1  0x0804baf7 in service_operation_cancellation (vfs_request=0x85227e0) 
> >>     at src/apps/kernel/linux/pvfs2-client-core.c:407
> >> #2  0x0804f311 in handle_unexp_vfs_request (vfs_request=0x85227e0) 
> >>     at src/apps/kernel/linux/pvfs2-client-core.c:2980
> >> #3  0x08050f1f in process_vfs_requests ()
> >>     at src/apps/kernel/linux/pvfs2-client-core.c:3180
> >> #4  0x080527a8 in main (argc=10, argv=0xbfa14434)
> >>     at src/apps/kernel/linux/pvfs2-client-core.c:3593
> >>
> >> I notice there are jumps for io_getattr and io_datafile_size which would 
> >> put other frames on the stack. Should the code after the small io check 
> >> just use the base frame pointer instead of sm_p? 
> >>
> >> Thanks,
> >> Michael
> >>
> >> On Wed, Jan 20, 2010 at 08:01:41AM -0600, Phil Carns wrote:
> >>> Great!  Thanks for testing it out.
> >>>
> >>> -Phil
> >>>
> >>> Michael Moore wrote:
> >>>> Thanks Phil, that appears to solve the problem! I tested it both against 
> >>>> head and orange branch and didn't see any of the infinite looping or 
> >>>> client segfaults. I tested it without any of the other changes so it 
> >>>> looks like that patch alone resolves the issue.
> >>>>
> >>>> Michael
> >>>>
> >>>> On Fri, Jan 15, 2010 at 03:28:54PM -0500, Phil Carns wrote:
> >>>>> Hi Michael,
> >>>>>
> >>>>> I just tried your test case on a clean trunk build here and was able to 
> >>>>> reproduce the pvfs2-client-core segfault 100% of the time on my box.
> >>>>>
> >>>>> The problem in a nutshell is that pvfs2-client-core was trying to 
> >>>>> cancel 
> >>>>> a small-io operation using logic that is only appropriate for a normal 
> >>>>> I/O operation, in turn causing some memory corruptions.
> >>>>>
> >>>>> Can you try out the fix and see if it solves the problem for you?  The 
> >>>>> patch is attached your you can pull it from cvs trunk.
> >>>>>
> >>>>> You might want to try that change by itself (without the op purged 
> >>>>> change) first and go from there.  Some of the other issues you ran into 
> >>>>> may have been an after-effect from the cancel problem.
> >>>>>
> >>>>> -Phil
> >> _______________________________________________
> >> Pvfs2-developers mailing list
> >> Pvfs2-developers@beowulf-underground.org
> >> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
> 
_______________________________________________
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Reply via email to