Hi Bart,

Attached is a patch that should fix the segv.  It looks like there might be 
multiple bugs causing this problem.  Working from the top down:

1.  I/O operations at the server are timing out.  This could just be due to 
server load or timeout values that are too small depending on the storage 
hardware involved.
2.  The server is trying to cancel I/O operations that have timed out, and is 
following the cancel path down into the pint-mgmt code (PINT_manager_cancel).  
The operation id given to that function is invalid, so PINT_manager_find_worker 
(called by PINT_manager_cancel) can't find the associated component to continue 
the cancel path, and returns an error code.
3.  PINT_manager_cancel fails to check the error code, so the normal code path 
is followed and the segfault occurs by dereferencing the worker pointer.  The 
attached patch should fix this segfault, but won't fix the issue that the 
operation id appears to be invalid, so the I/O operations won't get cancelled.  
The server will continue to run, but if you keep getting timeouts you will 
accumulate I/O operations at that server.
4.  Without the patch, the segv causes the PVFS server signal handler to get 
invoked, which tries to cleanup and shutdown the server.  While trying to 
cancel the bmi unexpected messages, it tries to lock the job completion_mutex, 
which must have been locked already by the trove completion thread or by the 
cancel path described above.  I think that once the segfault is fixed, the 
mutex will eventually get unlocked, and you will be able to signal the server 
to shutdown properly.
5.  Right now each kill signal is getting caught by the signal handler, which 
eventually tries to lock that same completion_mutex.

I'm not sure why you see those device error messages on the client, but I think 
that's a separate problem.

Are you directing the pvfs2-client log output to the system log?  If not, you 
might check the pvfs2-client log file.

There should be multiple threads running on the server.  In gdb you can do 
'thread apply all bt' to get the backtrace of all running threads.

-sam

Attachment: pint-mgmt-segv-fix.patch
Description: Binary data


On Oct 20, 2010, at 9:22 AM, Bart Taylor wrote:

> 
> Hey guys,
> 
> We have run into an issue a couple of times now were one of our 2.8 server 
> daemons will hang and cannot be killed. This happened again last night and I 
> captured some info from it this time. This particular case has 4 server nodes 
> running 2.6.18-194 kernels. About the time a lot of access started, there 
> were some client timeout messages and several of these in dmesg:
> 
> *****************************************************
> PVFS2 Device Error:  You cannot open the device file 
> /dev/pvfs2-req more than once.  Please make sure that
> there are no instances of a program using this device
> currently running. (You must verify this!)           
> For example, you can use the lsof program as follows:
> 'lsof | grep pvfs2-req' (run this as root)           
>   open_access_count = 1                              
> *****************************************************
> 
> A pvfs2-ping hung on the second server, and I discovered this in syslog on 
> that server. The messages begin when the server was started yesterday and end 
> when the errors stopped.
> 
> Oct 19 14:39:52 server2 PVFS2: [E]  PVFS2 server got signal 15 
> (server_status_flag: 507903)
> Oct 19 14:39:53 server2 kernel: kjournald starting.  Commit interval 5 seconds
> Oct 19 14:39:53 server2 kernel: EXT3 FS on sdb, internal journal
> Oct 19 14:39:53 server2 kernel: EXT3-fs: mounted filesystem with ordered data 
> mode.
> Oct 19 14:39:58 server2 PVFS2: [D] PVFS2 Server version 2.8.2 starting.
> Oct 19 17:32:57 server2 PVFS2: [E] job_time_mgr_expire: job time out: 
> cancelling flow operation, job_id: 121080.
> Oct 19 17:32:57 server2 PVFS2: [E] fp_multiqueue_cancel: flow proto cancel 
> called on 0x13f52d20
> Oct 19 17:32:57 server2 PVFS2: [E] fp_multiqueue_cancel: I/O error occurred
> Oct 19 17:32:57 server2 PVFS2: [E] handle_io_error: flow proto error cleanup 
> started on 0x13f52d20: Operation cancelled (possibly due to timeout)
> Oct 19 17:32:59 server2 PVFS2: [E] WARNING: Trove thread mgr cancel failed, 
> proceeding anyway.
> Oct 19 17:32:59 server2 last message repeated 3 times
> Oct 19 17:32:59 server2 PVFS2: [E] handle_io_error: flow proto 0x13f52d20 
> canceled 4 operations, will clean up.
> Oct 19 17:32:59 server2 PVFS2: [E] job_time_mgr_expire: job time out: 
> cancelling flow operation, job_id: 121085.
> Oct 19 17:32:59 server2 PVFS2: [E] fp_multiqueue_cancel: flow proto cancel 
> called on 0x13f3c750
> Oct 19 17:32:59 server2 PVFS2: [E] fp_multiqueue_cancel: I/O error occurred
> Oct 19 17:32:59 server2 PVFS2: [E] handle_io_error: flow proto error cleanup 
> started on 0x13f3c750: Operation cancelled (possibly due to timeout)
> Oct 19 17:32:59 server2 PVFS2: [E] WARNING: Trove thread mgr cancel failed, 
> proceeding anyway.
> Oct 19 17:32:59 server2 PVFS2: [E] trove_read_callback_fn: I/O error occurred
> Oct 19 17:33:00 server2 PVFS2: [E] WARNING: Trove thread mgr cancel failed, 
> proceeding anyway.
> Oct 19 17:33:02 server2 PVFS2: [E] trove_read_callback_fn: I/O error occurred
> Oct 19 17:33:03 server2 PVFS2: [E] WARNING: Trove thread mgr cancel failed, 
> proceeding anyway.
> Oct 19 17:33:03 server2 PVFS2: [E] WARNING: Trove thread mgr cancel failed, 
> proceeding anyway.
> Oct 19 17:33:04 server2 PVFS2: [E] handle_io_error: flow proto 0x13f3c750 
> canceled 4 operations, will clean up.
> Oct 19 17:33:06 server2 PVFS2: [E] trove_read_callback_fn: I/O error occurred
> Oct 19 17:33:07 server2 PVFS2: [E] job_time_mgr_expire: job time out: 
> cancelling flow operation, job_id: 121979.
> Oct 19 17:33:07 server2 PVFS2: [E] trove_read_callback_fn: I/O error occurred
> Oct 19 17:33:10 server2 PVFS2: [E] fp_multiqueue_cancel: flow proto cancel 
> called on 0x2aaaac1d80e0
> Oct 19 17:33:11 server2 PVFS2: [E] handle_io_error: flow proto 0x13f52d20 
> error cleanup finished: Operation cancelled (possibly due to timeout)
> Oct 19 17:33:11 server2 PVFS2: [E] fp_multiqueue_cancel: I/O error occurred
> Oct 19 17:33:11 server2 PVFS2: [E] handle_io_error: flow proto error cleanup 
> started on 0x2aaaac1d80e0: Operation cancelled (possibly due to timeout)
> Oct 19 17:33:11 server2 PVFS2: [E] handle_io_error: flow proto 0x2aaaac1d80e0 
> canceled 1 operations, will clean up.
> Oct 19 17:33:12 server2 PVFS2: [E] job_time_mgr_expire: job time out: 
> cancelling flow operation, job_id: 122772.
> Oct 19 17:33:12 server2 PVFS2: [E] trove_read_callback_fn: I/O error occurred
> Oct 19 17:33:12 server2 PVFS2: [E] fp_multiqueue_cancel: flow proto cancel 
> called on 0x2aaaac4b2800
> Oct 19 17:33:12 server2 PVFS2: [E] handle_io_error: flow proto 0x2aaaac1d80e0 
> error cleanup finished: Operation cancelled (possibly due to timeout)
> Oct 19 17:33:14 server2 PVFS2: [E] fp_multiqueue_cancel: I/O error occurred
> Oct 19 17:33:15 server2 PVFS2: [E] handle_io_error: flow proto error cleanup 
> started on 0x2aaaac4b2800: Operation cancelled (possibly due to timeout)
> Oct 19 17:33:15 server2 PVFS2: [E] WARNING: Trove thread mgr cancel failed, 
> proceeding anyway.
> Oct 19 17:33:17 server2 last message repeated 3 times
> Oct 19 17:33:17 server2 PVFS2: [E] handle_io_error: flow proto 0x2aaaac4b2800 
> canceled 4 operations, will clean up.
> Oct 19 17:33:17 server2 PVFS2: [E] job_time_mgr_expire: job time out: 
> cancelling flow operation, job_id: 122777.
> Oct 19 17:33:17 server2 PVFS2: [E] trove_read_callback_fn: I/O error occurred
> Oct 19 17:33:17 server2 PVFS2: [E] fp_multiqueue_cancel: flow proto cancel 
> called on 0x2aaaac223d90
> Oct 19 17:33:17 server2 PVFS2: [E] fp_multiqueue_cancel: I/O error occurred
> Oct 19 17:33:17 server2 PVFS2: [E] handle_io_error: flow proto error cleanup 
> started on 0x2aaaac223d90: Operation cancelled (possibly due to timeout)
> Oct 19 17:33:17 server2 PVFS2: [E] WARNING: Trove thread mgr cancel failed, 
> proceeding anyway.
> Oct 19 17:33:18 server2 PVFS2: [E] WARNING: Trove thread mgr cancel failed, 
> proceeding anyway.
> Oct 19 17:33:18 server2 PVFS2: [E] handle_io_error: flow proto 0x2aaaac223d90 
> canceled 2 operations, will clean up.
> Oct 19 17:33:18 server2 PVFS2: [E] trove_read_callback_fn: I/O error occurred
> Oct 19 17:33:18 server2 last message repeated 5 times
> Oct 19 17:33:18 server2 PVFS2: [E] handle_io_error: flow proto 0x13f3c750 
> error cleanup finished: Operation cancelled (possibly due to timeout)
> Oct 19 17:33:18 server2 PVFS2: [E] trove_read_callback_fn: I/O error occurred
> Oct 19 17:33:19 server2 PVFS2: [E] handle_io_error: flow proto 0x2aaaac4b2800 
> error cleanup finished: Operation cancelled (possibly due to timeout)
> Oct 19 17:33:19 server2 PVFS2: [E] trove_read_callback_fn: I/O error occurred
> Oct 19 17:33:19 server2 PVFS2: [E] job_time_mgr_expire: job time out: 
> cancelling flow operation, job_id: 123566.
> Oct 19 17:33:20 server2 PVFS2: [E] fp_multiqueue_cancel: flow proto cancel 
> called on 0x2aaaac1bacf0
> Oct 19 17:33:20 server2 PVFS2: [E] fp_multiqueue_cancel: I/O error occurred
> Oct 19 17:33:20 server2 PVFS2: [E] handle_io_error: flow proto error cleanup 
> started on 0x2aaaac1bacf0: Operation cancelled (possibly due to timeout)
> Oct 19 17:33:20 server2 PVFS2: [E] PINT_manager_find_worker: operation posted 
> with a queue id (30) that isn't held by this manager
> 
> 
> I tried killing the server to get it restarted, but it cannot be killed. I 
> attached GDB and found this:
> 
> 
> Program received signal SIGINT, Interrupt.
> 0x0000003d70a0d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
> (gdb) where
> #0  0x0000003d70a0d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
> #1  0x0000003d70a08e1a in _L_lock_1034 () from /lib64/libpthread.so.0
> #2  0x0000003d70a08cdc in pthread_mutex_lock () from /lib64/libpthread.so.0
> #3  0x0000000000452579 in gen_posix_mutex_lock (mut=0x6da360) at 
> ../pvfs2_src/src/common/gen-locks/gen-locks.c:45
> #4  0x0000000000431e67 in job_bmi_unexp_cancel (id=123372) at 
> ../pvfs2_src/src/io/job/job.c:811
> #5  0x0000000000413b06 in server_purge_unexpected_recv_machines () at 
> ../pvfs2_src/src/server/pvfs2-server.c:1879
> #6  0x000000000041340c in server_sig_handler (sig=15) at 
> ../pvfs2_src/src/server/pvfs2-server.c:1652
> #7  <signal handler called>
> #8  0x0000003d70a0d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
> #9  0x0000003d70a08e1a in _L_lock_1034 () from /lib64/libpthread.so.0
> #10 0x0000003d70a08cdc in pthread_mutex_lock () from /lib64/libpthread.so.0
> #11 0x0000000000452579 in gen_posix_mutex_lock (mut=0x6da360) at 
> ../pvfs2_src/src/common/gen-locks/gen-locks.c:45
> #12 0x0000000000431e67 in job_bmi_unexp_cancel (id=123372) at 
> ../pvfs2_src/src/io/job/job.c:811
> #13 0x0000000000413b06 in server_purge_unexpected_recv_machines () at 
> ../pvfs2_src/src/server/pvfs2-server.c:1879
> #14 0x000000000041340c in server_sig_handler (sig=11) at 
> ../pvfs2_src/src/server/pvfs2-server.c:1652
> #15 <signal handler called>
> #16 0x000000000049a367 in PINT_manager_cancel (manager=0x13e56a70, 
> op_id=46912520503856) at ../pvfs2_src/src/common/mgmt/pint-mgmt.c:1162
> #17 0x0000000000487b29 in dbpf_bstream_direct_cancel (coll_id=445355253, 
> cancel_id=46912520503472, context_id=0)
>     at ../pvfs2_src/src/io/trove/trove-dbpf/dbpf-bstream-direct.c:1249
> #18 0x000000000047cbce in dbpf_dspace_cancel (coll_id=445355253, 
> id=46912520503472, context_id=0)
>     at ../pvfs2_src/src/io/trove/trove-dbpf/dbpf-dspace.c:1551
> #19 0x000000000043ed18 in trove_dspace_cancel (coll_id=445355253, 
> id=46912520503472, context_id=0) at ../pvfs2_src/src/io/trove/trove.c:1045
> #20 0x0000000000439a65 in PINT_thread_mgr_trove_cancel (id=46912520503472, 
> fs_id=445355253, user_ptr=0x2aaaac196670)
>     at ../pvfs2_src/src/io/job/thread-mgr.c:726
> #21 0x000000000047052f in cancel_pending_trove (list=0x2aaaac005e50, 
> coll_id=445355253)
>     at 
> ../pvfs2_src/src/io/flow/flowproto-bmi-trove/flowproto-multiqueue.c:2160
> #22 0x00000000004700af in handle_io_error (error_code=-1610613121, 
> q_item=0x0, flow_data=0x2aaaac005c00)
>     at 
> ../pvfs2_src/src/io/flow/flowproto-bmi-trove/flowproto-multiqueue.c:2039
> #23 0x000000000046bbd5 in fp_multiqueue_cancel (flow_d=0x2aaaac1bacf0) at 
> ../pvfs2_src/src/io/flow/flowproto-bmi-trove/flowproto-multiqueue.c:480
> #24 0x000000000043ae1f in PINT_flow_cancel (flow_d=0x2aaaac1bacf0) at 
> ../pvfs2_src/src/io/flow/flow.c:426
> #25 0x0000000000432706 in job_flow_cancel (id=123566, context_id=0) at 
> ../pvfs2_src/src/io/job/job.c:1395
> #26 0x000000000043a41f in job_time_mgr_expire () at 
> ../pvfs2_src/src/io/job/job-time-mgr.c:277
> #27 0x0000000000426af2 in job_timer_do_work (smcb=0x13fb78a0, 
> js_p=0x13e3ecb0) at ../pvfs2_src/src/server/job-timer.sm:67
> #28 0x0000000000464591 in PINT_state_machine_invoke (smcb=0x13fb78a0, 
> r=0x13e3ecb0) at ../pvfs2_src/src/common/misc/state-machine-fns.c:132
> #29 0x000000000046498c in PINT_state_machine_next (smcb=0x13fb78a0, 
> r=0x13e3ecb0) at ../pvfs2_src/src/common/misc/state-machine-fns.c:309
> #30 0x00000000004649cb in PINT_state_machine_continue (smcb=0x13fb78a0, 
> r=0x13e3ecb0) at ../pvfs2_src/src/common/misc/state-machine-fns.c:327
> #31 0x0000000000410ef2 in main (argc=6, argv=0x7ffff6a54f28) at 
> ../pvfs2_src/src/server/pvfs2-server.c:413
> 
> 
> From the messages you probably deduced that we are using directio. This has 
> happened twice now that I know of. It looks like the canellation path might 
> be responsible, but I would appreciate someone taking a look. Can I provide 
> any more information while the server is still in this state? I will need to 
> fix it soon.
> 
> Thanks,
> Bart.
> 
> _______________________________________________
> Pvfs2-developers mailing list
> Pvfs2-developers@beowulf-underground.org
> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

_______________________________________________
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Reply via email to