Hi Michael,
Thanks for continuing to dig into that. The patch looks great to me.
You can go ahead and apply it straight to trunk if you don't mind.
thanks!
-Phil
Michael Moore wrote:
Attached is a patch that, so far, has resolved the cancel IO issues
we've been seeing.
In completion_list_retrieve_completed a cancelled IO operation gets the
base frame's user_ptr assigned to the user_ptr_array (which is the
vfs_request array used back in process_vfs_request). This change stops
the segfaults in process_vfs_requests. Then, in PINT_client_io_cancel
the references to the contexts come from sm_base_p instead of sm_p. That
ensures the context_count is correct and the context structure has the
correct data. Without this change it's looking at a frame without this
data.
Let me know if this looks okay, if so, can you apply it (or give me an
okay to apply it) to head?
Thanks,
Michael
On Wed, Feb 03, 2010 at 04:38:43PM -0500, Michael Moore wrote:
Hi Phil,
We're still seeing some issues around cancellation. One case I noticed,
but am finding hard to replicate, is when the sys-io state machine is in
the unstuff_xfer_msgpair state and has jumped to pvfs2_msgpairarray_sm.
For that state there will be a similar issue with a non I/O frame on the
stack, correct? The cases I've seen are when gibberish context counts
get printed such as the below and are followed by a segfault when
accessing cur_ctx.
[D 15:51:00.658599] PINT_client_io_cancel id 7707
[D 15:51:00.658639] base frame is at index: -1
[D 15:51:00.658648] PINT_client_io_cancel: sm_p->u.io.context_count: 8958368
[D 15:51:00.658657] PINT_client_io_cancel: iteration i: 0
#0 PINT_client_io_cancel (id=7707)
at src/client/sysint/client-state-machine.c:548
#1 0x0804baf7 in service_operation_cancellation (vfs_request=0x85227e0)
at src/apps/kernel/linux/pvfs2-client-core.c:407
#2 0x0804f311 in handle_unexp_vfs_request (vfs_request=0x85227e0)
at src/apps/kernel/linux/pvfs2-client-core.c:2980
#3 0x08050f1f in process_vfs_requests ()
at src/apps/kernel/linux/pvfs2-client-core.c:3180
#4 0x080527a8 in main (argc=10, argv=0xbfa14434)
at src/apps/kernel/linux/pvfs2-client-core.c:3593
I notice there are jumps for io_getattr and io_datafile_size which would
put other frames on the stack. Should the code after the small io check
just use the base frame pointer instead of sm_p?
Thanks,
Michael
On Wed, Jan 20, 2010 at 08:01:41AM -0600, Phil Carns wrote:
Great! Thanks for testing it out.
-Phil
Michael Moore wrote:
Thanks Phil, that appears to solve the problem! I tested it both against
head and orange branch and didn't see any of the infinite looping or
client segfaults. I tested it without any of the other changes so it
looks like that patch alone resolves the issue.
Michael
On Fri, Jan 15, 2010 at 03:28:54PM -0500, Phil Carns wrote:
Hi Michael,
I just tried your test case on a clean trunk build here and was able to
reproduce the pvfs2-client-core segfault 100% of the time on my box.
The problem in a nutshell is that pvfs2-client-core was trying to cancel
a small-io operation using logic that is only appropriate for a normal
I/O operation, in turn causing some memory corruptions.
Can you try out the fix and see if it solves the problem for you? The
patch is attached your you can pull it from cvs trunk.
You might want to try that change by itself (without the op purged
change) first and go from there. Some of the other issues you ran into
may have been an after-effect from the cancel problem.
-Phil
_______________________________________________
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
_______________________________________________
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers