Thanks Phil, will do. There appears to be a memory leak of some of the internal state machine data structures when an IO op is cancelled, still looking at that.
Michael On Mon, Feb 08, 2010 at 11:32:42AM -0500, Phil Carns wrote: > Hi Michael, > > Thanks for continuing to dig into that. The patch looks great to me. > > You can go ahead and apply it straight to trunk if you don't mind. > > thanks! > -Phil > > Michael Moore wrote: > > Attached is a patch that, so far, has resolved the cancel IO issues > > we've been seeing. > > > > In completion_list_retrieve_completed a cancelled IO operation gets the > > base frame's user_ptr assigned to the user_ptr_array (which is the > > vfs_request array used back in process_vfs_request). This change stops > > the segfaults in process_vfs_requests. Then, in PINT_client_io_cancel > > the references to the contexts come from sm_base_p instead of sm_p. That > > ensures the context_count is correct and the context structure has the > > correct data. Without this change it's looking at a frame without this > > data. > > > > Let me know if this looks okay, if so, can you apply it (or give me an > > okay to apply it) to head? > > > > Thanks, > > Michael > > > > On Wed, Feb 03, 2010 at 04:38:43PM -0500, Michael Moore wrote: > >> Hi Phil, > >> > >> We're still seeing some issues around cancellation. One case I noticed, > >> but am finding hard to replicate, is when the sys-io state machine is in > >> the unstuff_xfer_msgpair state and has jumped to pvfs2_msgpairarray_sm. > >> For that state there will be a similar issue with a non I/O frame on the > >> stack, correct? The cases I've seen are when gibberish context counts > >> get printed such as the below and are followed by a segfault when > >> accessing cur_ctx. > >> > >> [D 15:51:00.658599] PINT_client_io_cancel id 7707 > >> [D 15:51:00.658639] base frame is at index: -1 > >> [D 15:51:00.658648] PINT_client_io_cancel: sm_p->u.io.context_count: > >> 8958368 > >> [D 15:51:00.658657] PINT_client_io_cancel: iteration i: 0 > >> > >> #0 PINT_client_io_cancel (id=7707) > >> at src/client/sysint/client-state-machine.c:548 > >> #1 0x0804baf7 in service_operation_cancellation (vfs_request=0x85227e0) > >> at src/apps/kernel/linux/pvfs2-client-core.c:407 > >> #2 0x0804f311 in handle_unexp_vfs_request (vfs_request=0x85227e0) > >> at src/apps/kernel/linux/pvfs2-client-core.c:2980 > >> #3 0x08050f1f in process_vfs_requests () > >> at src/apps/kernel/linux/pvfs2-client-core.c:3180 > >> #4 0x080527a8 in main (argc=10, argv=0xbfa14434) > >> at src/apps/kernel/linux/pvfs2-client-core.c:3593 > >> > >> I notice there are jumps for io_getattr and io_datafile_size which would > >> put other frames on the stack. Should the code after the small io check > >> just use the base frame pointer instead of sm_p? > >> > >> Thanks, > >> Michael > >> > >> On Wed, Jan 20, 2010 at 08:01:41AM -0600, Phil Carns wrote: > >>> Great! Thanks for testing it out. > >>> > >>> -Phil > >>> > >>> Michael Moore wrote: > >>>> Thanks Phil, that appears to solve the problem! I tested it both against > >>>> head and orange branch and didn't see any of the infinite looping or > >>>> client segfaults. I tested it without any of the other changes so it > >>>> looks like that patch alone resolves the issue. > >>>> > >>>> Michael > >>>> > >>>> On Fri, Jan 15, 2010 at 03:28:54PM -0500, Phil Carns wrote: > >>>>> Hi Michael, > >>>>> > >>>>> I just tried your test case on a clean trunk build here and was able to > >>>>> reproduce the pvfs2-client-core segfault 100% of the time on my box. > >>>>> > >>>>> The problem in a nutshell is that pvfs2-client-core was trying to > >>>>> cancel > >>>>> a small-io operation using logic that is only appropriate for a normal > >>>>> I/O operation, in turn causing some memory corruptions. > >>>>> > >>>>> Can you try out the fix and see if it solves the problem for you? The > >>>>> patch is attached your you can pull it from cvs trunk. > >>>>> > >>>>> You might want to try that change by itself (without the op purged > >>>>> change) first and go from there. Some of the other issues you ran into > >>>>> may have been an after-effect from the cancel problem. > >>>>> > >>>>> -Phil > >> _______________________________________________ > >> Pvfs2-developers mailing list > >> Pvfs2-developers@beowulf-underground.org > >> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers > _______________________________________________ Pvfs2-developers mailing list Pvfs2-developers@beowulf-underground.org http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers