In continuing to look at the 100% CPU usage (kernel loop) Randy had written about previously I've narrowed the issue down a little. It appears related to cancellation of operations when a write() call is blocking and I/O has been retried.
While on our cluster the retries were caused by congestion I am re-creating the congestion by killing an I/O server. The test C program I'm using just loops around writes of 4k to a PVFS file. If, while the program is executing, I kill a PVFS I/O server the write hangs (expectedly) . About 30% of the time when I try to kill the process doing the writing it spikes to 100% CPU usage and is not killable. Also, every time I try to kill the writing process pvfs2-client-core segfaults with something similar to: [E 11:58:09.724121] PVFS2 client: signal 11, faulty address is 0x41ec, from 0x8050b51 [E 11:58:09.725403] [bt] pvfs2-client-core [0x8050b51] [E 11:58:09.725427] [bt] pvfs2-client-core(main+0xe48) [0x8052498] [E 11:58:09.725436] [bt] /lib/libc.so.6(__libc_start_main+0xdc) [0x75ee9c] [E 11:58:09.725444] [bt] pvfs2-client-core [0x804a381] [E 11:58:09.740133] Child process with pid 2555 was killed by an uncaught signal 6 In the cases when the CPU usage becomes 100% (and the process can't be terminated) the for() loop in PINT_client_io_cancel strangely segfaults during exactly iteration 31. The value of sm_p->u.io.context_count is in the hunderds so there are a signifigant number of jobs left to cancel. The real issue is the 30% of the time when the process gets stuck in the kernel waiting for a downcall. With some additional debugging, the process's write() call is clearly stuck in the while(1) loop in wait_for_cancellation_downcall(). The function's assumption is that either the request will timeout or it will be serviced after one iteration of the loop. However, in this situation it neither occurs. The schedule_timeout() call immediately returns with a signal pending but the op is never serviced so it spins indefinately. Has anyone else seen the issue with client-core segfaulting on every cancel op? Should the kernel wait_for_cancellation_downcall() be changed to not allow indefinite looping? Thanks, Michael _______________________________________________ Pvfs2-developers mailing list Pvfs2-developers@beowulf-underground.org http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers