In continuing to look at the 100% CPU usage (kernel loop) Randy had 
written about previously I've narrowed the issue down a little. It 
appears related to cancellation of operations when a write() call 
is blocking and I/O has been retried. 

While on our cluster the retries were caused by congestion I am 
re-creating the congestion by killing an I/O server. The test C program 
I'm using just loops around writes of 4k to a PVFS file. If, 
while the program is executing, I kill a PVFS I/O server the write hangs 
(expectedly) . About 30% of the time when I try to kill the 
process doing the writing it spikes to 100% CPU usage and is not 
killable. Also, every time I try to kill the writing process 
pvfs2-client-core segfaults with something similar to:

[E 11:58:09.724121] PVFS2 client: signal 11, faulty address is 0x41ec, 
from 0x8050b51
[E 11:58:09.725403] [bt] pvfs2-client-core [0x8050b51]
[E 11:58:09.725427] [bt] pvfs2-client-core(main+0xe48) [0x8052498]
[E 11:58:09.725436] [bt] /lib/libc.so.6(__libc_start_main+0xdc) 
[0x75ee9c]
[E 11:58:09.725444] [bt] pvfs2-client-core [0x804a381]
[E 11:58:09.740133] Child process with pid 2555 was killed by an 
uncaught signal 6

In the cases when the CPU usage becomes 100% (and the process can't be 
terminated) the for() loop in PINT_client_io_cancel strangely segfaults 
during exactly iteration 31. The value of sm_p->u.io.context_count is 
in the hunderds so there are a signifigant number of jobs left to cancel.

The real issue is the 30% of the time when the process gets stuck in the 
kernel waiting for a downcall. With some additional debugging, the 
process's write() call is clearly stuck in the while(1) loop in 
wait_for_cancellation_downcall(). The function's assumption is that 
either the request will timeout or it will be serviced after one 
iteration of the loop. However, in this situation it neither occurs. The 
schedule_timeout() call immediately returns with a signal pending but 
the op is never serviced so it spins indefinately.

Has anyone else seen the issue with client-core segfaulting on every 
cancel op? Should the kernel wait_for_cancellation_downcall() be changed 
to not allow indefinite looping? 

Thanks,
Michael
_______________________________________________
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Reply via email to