Hi Michael,

I just tried your test case on a clean trunk build here and was able to reproduce the pvfs2-client-core segfault 100% of the time on my box.

The problem in a nutshell is that pvfs2-client-core was trying to cancel a small-io operation using logic that is only appropriate for a normal I/O operation, in turn causing some memory corruptions.

Can you try out the fix and see if it solves the problem for you? The patch is attached your you can pull it from cvs trunk.

You might want to try that change by itself (without the op purged change) first and go from there. Some of the other issues you ran into may have been an after-effect from the cancel problem.

-Phil


Michael Moore wrote:
On Thu, Jan 07, 2010 at 06:48:19PM -0500, Michael Moore wrote:
On Thu, Jan 07, 2010 at 03:07:17PM -0600, Sam Lang wrote:
On Jan 7, 2010, at 2:29 PM, Michael Moore wrote:

On Thu, Jan 07, 2010 at 12:30:00PM -0600, Sam Lang wrote:
On Jan 7, 2010, at 11:32 AM, Michael Moore wrote:

In continuing to look at the 100% CPU usage (kernel loop) Randy had written about previously I've narrowed the issue down a little. It appears related to cancellation of operations when a write() call is blocking and I/O has been retried. While on our cluster the retries were caused by congestion I am re-creating the congestion by killing an I/O server. The test C program I'm using just loops around writes of 4k to a PVFS file. If, while the program is executing, I kill a PVFS I/O server the write hangs (expectedly) . About 30% of the time when I try to kill the process doing the writing it spikes to 100% CPU usage and is not killable. Also, every time I try to kill the writing process pvfs2-client-core segfaults with something similar to:

[E 11:58:09.724121] PVFS2 client: signal 11, faulty address is 0x41ec, from 0x8050b51
[E 11:58:09.725403] [bt] pvfs2-client-core [0x8050b51]
[E 11:58:09.725427] [bt] pvfs2-client-core(main+0xe48) [0x8052498]
[E 11:58:09.725436] [bt] /lib/libc.so.6(__libc_start_main+0xdc) [0x75ee9c]
[E 11:58:09.725444] [bt] pvfs2-client-core [0x804a381]
[E 11:58:09.740133] Child process with pid 2555 was killed by an uncaught signal 6

In the cases when the CPU usage becomes 100% (and the process can't be terminated) the for() loop in PINT_client_io_cancel strangely segfaults during exactly iteration 31. The value of sm_p->u.io.context_count is in the hunderds so there are a signifigant number of jobs left to cancel.
Hi Michael,

Are you guys using infiniband by chance?  Do you have a stack trace with 
debugging symbols where the pvfs2-client-core segfault occurs?  That might be 
useful for narrowing things down.

No infiniband at use, vanilla ethernet.

Stack trace with symbols:
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xb7fe06d0 (LWP 2069)]
PINT_client_io_cancel (id=5594) at src/client/sysint/client-state-machine.c:524
524             if (cur_ctx->msg_send_in_progress)
(gdb) (gdb) (gdb) bt
#0  PINT_client_io_cancel (id=5594)
   at src/client/sysint/client-state-machine.c:524
#1  0x0804b926 in service_operation_cancellation (vfs_request=0xa073d98)
   at src/apps/kernel/linux/pvfs2-client-core.c:407
#2  0x0804f141 in handle_unexp_vfs_request (vfs_request=0xa073d98)
   at src/apps/kernel/linux/pvfs2-client-core.c:2959
#3  0x08050c7b in process_vfs_requests ()
   at src/apps/kernel/linux/pvfs2-client-core.c:3148
#4  0x08052498 in main (argc=12, argv=0xbfe0d224)
   at src/apps/kernel/linux/pvfs2-client-core.c:3561
(gdb)

In addition to the above segfault, I'm also seeing one for vfs_request in process_vfs_requests. The vfs_request segfault is the one depicted in the logs attached. The segfault is just at the first time the function accesses vfs_request:
Program received signal SIGSEGV, Segmentation fault.
0x08050cd7 in process_vfs_requests ()
    at src/apps/kernel/linux/pvfs2-client-core.c:3125
3125                if (vfs_request->num_ops == 1 &&

Again, this is orange branch CVS line numbers. As a side note I get a configure error when trying to build the head version:
"config.status: error: cannot find input file: src/io/bmi/bmi_zoid/module.mk.in"

It's obvious that vfs_request is not a valid address but I haven't figured out where it's getting mucked with or thrown away. Note, these logs are after making the op_state_purged(op) change recommended below. That change seems to have prevented the kernel infinite loop issue as you indicated it would.

I don't think I'm doing anything special from the application side but I'm also attaching the application I'm using to create this issue.

Thanks,
Michael

What version of PVFS is this?  I'm having trouble matching up the lines to 
anything interesting.
-sam

Sorry, Orange-Branch out of CVS.

Michael


The real issue is the 30% of the time when the process gets stuck in the kernel waiting for a downcall. With some additional debugging, the process's write() call is clearly stuck in the while(1) loop in wait_for_cancellation_downcall(). The function's assumption is that either the request will timeout or it will be serviced after one iteration of the loop. However, in this situation it neither occurs. The schedule_timeout() call immediately returns with a signal pending but the op is never serviced so it spins indefinately.
Those infinite looping conditionals have bugged me for a while now (there's one 
in wait_for_matching_downcall too).  We should probably go through and 
systematically replace all of them.  Lets try to fix your bug first though.

I'm actually surprised that the process goes to 100% cpu in this case.  You're 
right that the op is not getting serviced, so the first if conditional won't 
break out of the while loop.  But the schedule_timeout should only return 
non-zero if the task gets woken up, and it only gets woken up in 
purge_waiting_ops() when the pvfs2-client-core segfaults.  And that should only 
happen once.  One thing you might try is to change the first if conditional 
from:

       if (op_state_serviced(op))

to:

       if (op_state_serviced(op) || op_state_purged(op))

That will allow purged ops to exit the while loop.  Could you share your 
debugging output and modified code?
No special code other than some gossip_debugs. The
kernel debugging right before and including the cancel is:

Jan  7 15:15:25 vm04 kernel: Alloced OP (d0c48020: 875 OP_FILE_IO)
Jan  7 15:15:25 vm04 kernel: GET op d0c48020 -> buffer_index 0
Jan 7 15:15:25 vm04 kernel: pvfs2_file_write: copy_to_user 1 nr_segs 1, offset: 3129344 total_size: 4096 Jan 7 15:15:25 vm04 kernel: pvfs_bufmap_copy_iovec_from_user: index 0, size 4096 Jan 7 15:15:25 vm04 kernel: First character (integer value) in pvfs_bufmap_copy_from_user: 63 Jan 7 15:15:25 vm04 kernel: pvfs2: service_operation: pvfs2_file_write d0c48020 Jan 7 15:15:25 vm04 kernel: pvfs2: operation posted by process: pvfs_rand_write, pid: 2076
Jan  7 15:15:25 vm04 kernel: Releasing OP (d0c98060: 874)
Jan  7 15:15:25 vm04 kernel: client-core: reading op tag 875 OP_FILE_IO
Jan 7 15:15:55 vm04 kernel: *** operation interrupted by a signal (tag 875, op d0c48020) Jan 7 15:15:55 vm04 kernel: Interrupted: Removed op d0c48020 from htable_ops_in_progress Jan 7 15:15:55 vm04 kernel: pvfs2: service_operation pvfs2_file_write returning: -4 for d0c48020. Jan 7 15:15:55 vm04 kernel: pvfs2_cancel_op_in_progress called on tag 875
Jan  7 15:15:55 vm04 kernel: Alloced OP (da5000a0: 876 OP_CANCEL)
Jan 7 15:15:55 vm04 kernel: Attempting PVFS2 operation cancellation of tag 875 Jan 7 15:15:55 vm04 kernel: pvfs2: service_operation: pvfs2_cancel da5000a0 Jan 7 15:15:55 vm04 kernel: pvfs2: operation posted by process: pvfs_rand_write, pid: 2076 Jan 7 15:15:55 vm04 kernel: *** operation didn't time out: da5000a0, jiffies_left: 20000
Jan  7 15:15:57 vm04 last message repeated 3 times
Jan 7 15:15:58 vm04 kernel: *** operation didn't time out: da5000a0, jiffies_left: 19999 Jan 7 15:15:58 vm04 kernel: *** operation didn't time out: da5000a0, jiffies_left: 20000
Jan  7 15:15:58 vm04 last message repeated 59 times
............


The pvfs2-client debug log (from an earlier failure) proves that my statement earlier about segfaulting in the 31st iteration false. This one goes into the 70's.

[D 14:35:53.231333] cancel_op_in_progress called
[D 14:35:53.231342] cancelling I/O req 0x88c7f98 from tag 1128
[D 14:35:53.231357] PINT_client_io_cancel id 7350
[D 14:35:53.231368] job iteration [0]
[D 14:35:53.231377] [0] Posting cancellation of type: BMI Send (Request)
[D 14:35:53.231386] [0] Posting cancellation of type: BMI Recv (Response)
[D 14:35:53.231396] [0] Posting cancellation of type: FLOW
[D 14:35:53.231431] [0] Posting cancellation of type: BMI Recv (Write Ack)
[D 14:35:53.231444] job iteration [1]
[D 14:35:53.231452] [1] Posting cancellation of type: BMI Send (Request)
[D 14:35:53.231461] [1] Posting cancellation of type: BMI Recv (Response) [D 14:35:53.231469] [1] Posting cancellation of type: BMI Recv (Write Ack)
[D 14:35:53.231502] job iteration [2]
[D 14:35:53.231510] [2] Posting cancellation of type: BMI Send (Request)
[D 14:35:53.231519] [2] Posting cancellation of type: BMI Recv (Response)
[D 14:35:53.231528] [2] Posting cancellation of type: FLOW
[D 14:35:53.231536] [2] Posting cancellation of type: BMI Recv (Write Ack)
[D 14:35:53.231545] job iteration [3]
[D 14:35:53.231553] [3] Posting cancellation of type: BMI Send (Request)
[D 14:35:53.231562] [3] Posting cancellation of type: BMI Recv (Response)
[D 14:35:53.231571] [3] Posting cancellation of type: FLOW
[D 14:35:53.231579] [3] Posting cancellation of type: BMI Recv (Write Ack)
[D 14:35:53.231588] job iteration [4]
[D 14:35:53.231596] [4] Posting cancellation of type: BMI Send (Request)
[D 14:35:53.231605] [4] Posting cancellation of type: BMI Recv (Response)
[D 14:35:53.231614] [4] Posting cancellation of type: FLOW
[D 14:35:53.231622] [4] Posting cancellation of type: BMI Recv (Write Ack)
[D 14:35:53.231628] job iteration [5]
[D 14:35:53.231666] [5] Posting cancellation of type: BMI Send (Request)
[D 14:35:53.231675] [5] Posting cancellation of type: BMI Recv (Response)
[D 14:35:53.231684] [5] Posting cancellation of type: FLOW
[D 14:35:53.231692] [5] Posting cancellation of type: BMI Recv (Write Ack)
[D 14:35:53.231701] job iteration [6]
[D 14:35:53.231709] [6] Posting cancellation of type: BMI Send (Request)
[D 14:35:53.231718] [6] Posting cancellation of type: BMI Recv (Response)
[D 14:35:53.231727] [6] Posting cancellation of type: FLOW
[D 14:35:53.231735] [6] Posting cancellation of type: BMI Recv (Write Ack)
............
<snip>
............
[D 14:35:53.235128] job iteration [69]
[D 14:35:53.235136] [69] Posting cancellation of type: FLOW
[D 14:35:53.235145] [69] Posting cancellation of type: BMI Recv (Write Ack)
[D 14:35:53.235153] job iteration [70]
[D 14:35:53.235161] job iteration [71]
[D 14:35:53.235169] job iteration [72]
[D 14:35:53.235177] job iteration [73]
[E 14:35:53.235212] PVFS2 client: signal 11, faulty address is 0xb8c0c8, from 0xb26900 [E 14:35:53.235810] [bt] /usr/local/pvfs/lib/libpvfs2.so(PINT_client_io_cancel+0
xe0) [0xb26900]
[E 14:35:53.235828] [bt] pvfs2-client-core [0x804b926]
[E 14:35:53.235836] [bt] pvfs2-client-core [0x804f141]
[E 14:35:53.235844] [bt] pvfs2-client-core [0x8050c7b]
[E 14:35:53.235852] [bt] pvfs2-client-core(main+0xe48) [0x8052498]


I'll give the pending_ops change a try and post the results.

Michael

Thanks,

-sam

Has anyone else seen the issue with client-core segfaulting on every cancel op? Should the kernel wait_for_cancellation_downcall() be changed to not allow indefinite looping?
Thanks,
Michael
_______________________________________________
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
_______________________________________________
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

------------------------------------------------------------------------

_______________________________________________
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Index: src/client/sysint/client-state-machine.c
===================================================================
RCS file: /projects/cvsroot/pvfs2-1/src/client/sysint/client-state-machine.c,v
retrieving revision 1.103
diff -a -u -p -r1.103 client-state-machine.c
--- src/client/sysint/client-state-machine.c	7 Oct 2009 13:48:49 -0000	1.103
+++ src/client/sysint/client-state-machine.c	15 Jan 2010 20:15:16 -0000
@@ -483,6 +483,7 @@ PVFS_error PINT_client_io_cancel(PVFS_sy
     PVFS_error ret = -PVFS_EINVAL;
     PINT_smcb *smcb = NULL;
     PINT_client_sm *sm_p = NULL;
+    PINT_client_sm *sm_base_p = NULL;
 
     gossip_debug(GOSSIP_CLIENT_DEBUG,
             "PINT_client_io_cancel id %lld\n",lld(id));
@@ -507,6 +508,18 @@ PVFS_error PINT_client_io_cancel(PVFS_sy
     {
 	/* op already completed; nothing to cancel. */
         return 0;
+    }
+    
+    /* We also don't cancel small I/O operations as posted by
+     * sys-small-io.sm.  Check the corresponding flag.  We have 
+     * to jump to the base frame rather than the current frame for this
+     * information because small-io may have pushed a msgpairarray.
+     */ 
+    sm_base_p = PINT_sm_frame(smcb, (-(smcb->frame_count -1)));
+    if(sm_base_p->u.io.small_io)
+    {
+        gossip_debug(GOSSIP_CANCEL_DEBUG,  "skipping cancellation of small I/O operation.\n");
+        return(0);
     }
 
     /* if we fall to here, the I/O operation is still in flight */
_______________________________________________
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Reply via email to