On 02/21/2007 06:42 PM, Sam Lang wrote: > > On Feb 21, 2007, at 11:28 AM, Trach-Minh Tran wrote: > >> On 02/21/2007 06:10 PM, Sam Lang wrote: >>> >>> On Feb 21, 2007, at 10:49 AM, Trach-Minh Tran wrote: >>> >>>> On 02/21/2007 05:18 PM, Sam Lang wrote: >>>>> >>>>> Hi Minh, >>>>> >>>>> I got the order of my AC_TRY_COMPILE arguments wrong. That was pretty >>>>> sloppy on my part. I've attached a patch that should fix the error >>>>> you're getting. I'm not sure it will apply cleanly to the already >>>>> patched 2.6.2 source that you have. Better to start with a clean >>>>> 2.6.2 >>>>> tarball. >>>> >>>> Hi Sam, >>>> >>>> Thanks for you prompt response. I can now load the module. I will do >>>> some more tests with this 2.6.2 version. Until now, I've found >>>> using my MPI-IO program, that this is not as stable as the 2.6.1 >>>> version: >>>> During about 1/2 hour running the test, already 2 data servers (out >>>> of 8) have died! >>> >>> That's surprising, the 2.6.2 release didn't include any changes to the >>> servers from 2.6.1. Did you get any messages in the server logs on the >>> nodes that died? >>> >>>> >>>> Do you think that I should stay with 2.6.1 + the misc-bug.patch from >>>> Murali? >>> >>> There aren't any other significant fixes in 2.6.2 besides support for >>> the latest Berkeley DB release, and the misc-bug patch that you mention, >>> so using 2.6.1 shouldn't be a problem for you. That being said, if the >>> servers crash for you on 2.6.2, its likely that they will do so with >>> 2.6.1 and you just haven't hit it yet. I'd also like to figure out >>> exactly what is causing the servers to crash. Can you send your MPI-IO >>> program to us? >>> >> >> Hi Sam, >> >> There is nothing in the server logs! May be tomorrow (it is now 6:30 pm >> here) I will have more infos from the mpi-io runs I've just submitted. > > Rob thinks this might be related to the ROMIO ad_pvfs bug reported a > couple days ago, but the even so, corruption on the client shouldn't > cause the server's to segfault (esp. if the corruption is outside the > PVFS system interfaces). If possible, it would be great to get a stack > trace from one of the crashed servers.
Hi Sam, How can I get the stack strace ph the pvfs2 server when it dies? I have run another series of tests with the mpi-io program for another hour but the none of the servers died! I can add that when one of the servers previously died, I've got the following messages from my mpi program while nothing appears in the pvfs2_server.log file: ===================================== [E 17:26:51.714686] msgpair failed, will retry: Broken pipe [E 17:26:51.736877] handle_io_error: flow proto error cleanup started on 0x6fd870, error_code: -1073741973 [E 17:26:51.737091] handle_io_error: flow proto 0x6fd870 canceled 0 operations, will clean up. [E 17:26:51.737108] handle_io_error: flow proto 0x6fd870 error cleanup finished, error_code: -1073741973 [E 17:26:53.734663] msgpair failed, will retry: Connection refused [E 17:26:55.754647] msgpair failed, will retry: Connection refused [E 17:26:57.774636] msgpair failed, will retry: Connection refused [E 17:26:59.794622] msgpair failed, will retry: Connection refused [E 17:27:01.814610] msgpair failed, will retry: Connection refused [E 17:27:01.814651] *** msgpairarray_completion_fn: msgpair to server tcp://io4:3334 failed: Connection refused [E 17:27:01.814666] *** Out of retries. ===================================== -Minh. _______________________________________________ Pvfs2-users mailing list [email protected] http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
