Followup:  It looks like more stuff is broke.  My longwatch has this line in it:

WARNING:  Kernel Errors Present
    pvfs2_file_read: error in vectored read ...:  2151 Time(s)

And I got an e-mail from my main user:

---------------------
I issued the command from lar-transfer

cp
/mnt/pvfs2/schung/amt_toolkit/trunk/MILAGRO/Obs/aircraft/c130/mrg60_c130_200
60318_r4/* .

and got error messages such as

cp: reading
`/mnt/pvfs2/schung/amt_toolkit/trunk/MILAGRO/Obs/aircraft/c130/mrg60_c130_20
060318_r4/wind_speed_obs.txt': No such file or directory


But as far as I tell, the file exists and was copied over correctly.  I get
the error message for all the files that were copied, but all the files were
copied correctly (as far as I can tell).
-----------------------

I've had other people complain about issues with relative paths
failing, but absolute paths succeeding.

I'm getting a really bad feeling about this.....

--Jim

On Fri, Nov 6, 2009 at 3:15 PM, Jim Kusznir <[email protected]> wrote:
> Hi all:
>
> Well, it happened again today...Another pvfs2 client crash on my head
> node.  This one was worse than many I've experienced.  When my users
> informed me, I found the system load at 14 with the cpu utilization %
> fairly low.  pvfs2-client-core was state "r" at 100% utilization, but
> was actually not getting anything done (the I/O that was underway has
> very definately stopped according to the user).  I also noticed that
> the process had only been alive for about 2 hours at that time (last
> time pvfs2-client had restarted pvfs2-client-core), and the problems
> started getting noticably worse at that point in time.
>
> I tried a kill -9 on the process, again hoping to get a responsive
> pvfs2-client-core process, but nothing happened...it would not die.
> It would not respond to anything.  Even when I tried to reboot the
> server, it wouldn't give way (or let the server reboot..I finally had
> to go down to the machine room and hard power cycle the system).
>
> The pvfs2-client.log file had this to say:
>
> [D 23:37:29.802697] [INFO]: Mapping pointer 0x2b44497ef000 for I/O.
> [D 23:37:29.818953] [INFO]: Mapping pointer 0xaaf7000 for I/O.
> [E 23:47:37.825740] PVFS2 client: signal 11, faulty address is 0x7b9,
> from 0x425120
> [E 23:47:37.826245] [bt]
> pvfs2-client-core(PINT_client_io_cancel+0x1f0) [0x425120]
> [E 23:47:37.826260] [bt]
> pvfs2-client-core(PINT_client_io_cancel+0x1f0) [0x425120]
> [E 23:47:37.826271] [bt] pvfs2-client-core [0x40ee95]
> [E 23:47:37.826282] [bt] pvfs2-client-core [0x41248e]
> [E 23:47:37.826292] [bt] pvfs2-client-core [0x4133df]
> [E 23:47:37.826302] [bt] pvfs2-client-core(main+0xc60) [0x414780]
> [E 23:47:37.826312] [bt] /lib64/libc.so.6(__libc_start_main+0xf4) 
> [0x3694c1d8b4]
> [E 23:47:37.826322] [bt] pvfs2-client-core [0x40d989]
> [E 23:47:37.831308] Child process with pid 31589 was killed by an
> uncaught signal 6
> [E 23:47:37.835138] PVFS Client Daemon Started.  Version 2.8.1
> [D 23:47:37.835358] [INFO]: Mapping pointer 0x2b46cd653000 for I/O.
> [D 23:47:37.851828] [INFO]: Mapping pointer 0x1a4e1000 for I/O.
> [E 23:51:34.274309] Child process with pid 32537 was killed by an
> uncaught signal 6
> [E 23:51:34.278184] PVFS Client Daemon Started.  Version 2.8.1
> [D 23:51:34.278404] [INFO]: Mapping pointer 0x2ba89091f000 for I/O.
> [D 23:51:34.294720] [INFO]: Mapping pointer 0x131a6000 for I/O.
> [E 23:58:20.185034] PVFS2 client: signal 11, faulty address is 0x78c,
> from 0x425120
> [E 23:58:20.185553] [bt]
> pvfs2-client-core(PINT_client_io_cancel+0x1f0) [0x425120]
> [E 23:58:20.185568] [bt]
> pvfs2-client-core(PINT_client_io_cancel+0x1f0) [0x425120]
> [E 23:58:20.185579] [bt] pvfs2-client-core [0x40ee95]
> [E 23:58:20.185590] [bt] pvfs2-client-core [0x41248e]
> [E 23:58:20.185600] [bt] pvfs2-client-core [0x4133df]
> [E 23:58:20.185610] [bt] pvfs2-client-core(main+0xc60) [0x414780]
> [E 23:58:20.185620] [bt] /lib64/libc.so.6(__libc_start_main+0xf4) 
> [0x3694c1d8b4]
> [E 23:58:20.185630] [bt] pvfs2-client-core [0x40d989]
> [E 23:58:20.190414] Child process with pid 32677 was killed by an
> uncaught signal 6
> [E 23:58:20.194285] PVFS Client Daemon Started.  Version 2.8.1
> [D 23:58:20.194506] [INFO]: Mapping pointer 0x2acb010a2000 for I/O.
> [D 23:58:20.211131] [INFO]: Mapping pointer 0x5417000 for I/O.
> [E 09:50:42.173029] fp_multiqueue_cancel: flow proto cancel called on 
> 0x587c078
> [E 09:50:42.173091] fp_multiqueue_cancel: I/O error occurred
> [E 09:50:42.173107] handle_io_error: flow proto error cleanup started
> on 0x587c078: Operation cancelled (possibly due to timeout)
> [E 09:50:42.173161] handle_io_error: flow proto 0x587c078 canceled 1
> operations, will clean up.
> [E 09:50:42.173209] bmi_to_mem_callback_fn: I/O error occurred
> [E 09:50:42.173223] handle_io_error: flow proto 0x587c078 error
> cleanup finished: Operation cancelled (possibly due to timeout)
> [E 09:52:42.125385] PVFS2 client: signal 11, faulty address is 0x41d5,
> from 0x413301
> [E 09:52:42.125962] [bt] pvfs2-client-core [0x413301]
> [E 09:52:42.125977] [bt] pvfs2-client-core [0x413301]
> [E 09:52:42.125988] [bt] pvfs2-client-core(main+0xc60) [0x414780]
> [E 09:52:42.125999] [bt] /lib64/libc.so.6(__libc_start_main+0xf4) 
> [0x3694c1d8b4]
> [E 09:52:42.126009] [bt] pvfs2-client-core [0x40d989]
> [E 09:52:42.131014] Child process with pid 377 was killed by an
> uncaught signal 6
> [E 09:52:42.134941] PVFS Client Daemon Started.  Version 2.8.1
> [D 09:52:42.135166] [INFO]: Mapping pointer 0x2b416a037000 for I/O.
> [D 09:52:42.151602] [INFO]: Mapping pointer 0x9f7e000 for I/O.
> [E 10:30:49.813443] Got an unrecognized/unimplemented vfs operation of
> type ff000000.
> [E 10:30:49.813524] Post of op: PVFS_VFS_OP_INVALID failed!
>
> I was not running this in anything I could get a trace or other
> debugging information, and due to the users lining up, I couldn't take
> any more time debugging, so I restarted to get it back online.  Its
> currently running without valgrind or others.  I have asked the user
> whom was most active when this happened to pay extra attention and let
> me know if she can reproduce the problem.
>
> --Jim
>

_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Reply via email to