Hi Michael,

Thanks for continuing to dig into that.  The patch looks great to me.

You can go ahead and apply it straight to trunk if you don't mind.

thanks!
-Phil

Michael Moore wrote:
Attached is a patch that, so far, has resolved the cancel IO issues we've been seeing. In completion_list_retrieve_completed a cancelled IO operation gets the base frame's user_ptr assigned to the user_ptr_array (which is the vfs_request array used back in process_vfs_request). This change stops the segfaults in process_vfs_requests. Then, in PINT_client_io_cancel the references to the contexts come from sm_base_p instead of sm_p. That ensures the context_count is correct and the context structure has the correct data. Without this change it's looking at a frame without this data.

Let me know if this looks okay, if so, can you apply it (or give me an okay to apply it) to head?

Thanks,
Michael

On Wed, Feb 03, 2010 at 04:38:43PM -0500, Michael Moore wrote:
Hi Phil,

We're still seeing some issues around cancellation. One case I noticed, but am finding hard to replicate, is when the sys-io state machine is in the unstuff_xfer_msgpair state and has jumped to pvfs2_msgpairarray_sm. For that state there will be a similar issue with a non I/O frame on the stack, correct? The cases I've seen are when gibberish context counts get printed such as the below and are followed by a segfault when accessing cur_ctx.

[D 15:51:00.658599] PINT_client_io_cancel id 7707
[D 15:51:00.658639] base frame is at index: -1
[D 15:51:00.658648] PINT_client_io_cancel: sm_p->u.io.context_count: 8958368
[D 15:51:00.658657] PINT_client_io_cancel: iteration i: 0

#0 PINT_client_io_cancel (id=7707) at src/client/sysint/client-state-machine.c:548 #1 0x0804baf7 in service_operation_cancellation (vfs_request=0x85227e0) at src/apps/kernel/linux/pvfs2-client-core.c:407 #2 0x0804f311 in handle_unexp_vfs_request (vfs_request=0x85227e0) at src/apps/kernel/linux/pvfs2-client-core.c:2980
#3  0x08050f1f in process_vfs_requests ()
    at src/apps/kernel/linux/pvfs2-client-core.c:3180
#4  0x080527a8 in main (argc=10, argv=0xbfa14434)
    at src/apps/kernel/linux/pvfs2-client-core.c:3593

I notice there are jumps for io_getattr and io_datafile_size which would put other frames on the stack. Should the code after the small io check just use the base frame pointer instead of sm_p?
Thanks,
Michael

On Wed, Jan 20, 2010 at 08:01:41AM -0600, Phil Carns wrote:
Great!  Thanks for testing it out.

-Phil

Michael Moore wrote:
Thanks Phil, that appears to solve the problem! I tested it both against head and orange branch and didn't see any of the infinite looping or client segfaults. I tested it without any of the other changes so it looks like that patch alone resolves the issue.

Michael

On Fri, Jan 15, 2010 at 03:28:54PM -0500, Phil Carns wrote:
Hi Michael,

I just tried your test case on a clean trunk build here and was able to reproduce the pvfs2-client-core segfault 100% of the time on my box.

The problem in a nutshell is that pvfs2-client-core was trying to cancel a small-io operation using logic that is only appropriate for a normal I/O operation, in turn causing some memory corruptions.

Can you try out the fix and see if it solves the problem for you? The patch is attached your you can pull it from cvs trunk.

You might want to try that change by itself (without the op purged change) first and go from there. Some of the other issues you ran into may have been an after-effect from the cancel problem.

-Phil
_______________________________________________
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

_______________________________________________
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Reply via email to