On 02/21/2007 06:42 PM, Sam Lang wrote:
> 
> On Feb 21, 2007, at 11:28 AM, Trach-Minh Tran wrote:
> 
>> On 02/21/2007 06:10 PM, Sam Lang wrote:
>>>
>>> On Feb 21, 2007, at 10:49 AM, Trach-Minh Tran wrote:
>>>
>>>> On 02/21/2007 05:18 PM, Sam Lang wrote:
>>>>>
>>>>> Hi Minh,
>>>>>
>>>>> I got the order of my AC_TRY_COMPILE arguments wrong.  That was pretty
>>>>> sloppy on my part.  I've attached a patch that should fix the error
>>>>> you're getting.  I'm not sure it will apply cleanly to the already
>>>>> patched 2.6.2 source that you have.  Better to start with a clean
>>>>> 2.6.2
>>>>> tarball.
>>>>
>>>> Hi Sam,
>>>>
>>>> Thanks for you prompt response. I can now load the module. I will do
>>>> some more tests with this 2.6.2 version. Until now, I've found
>>>> using my MPI-IO program, that this is not as stable as the 2.6.1
>>>> version:
>>>> During about 1/2 hour running the test, already 2 data servers (out
>>>> of 8) have died!
>>>
>>> That's surprising, the 2.6.2 release didn't include any changes to the
>>> servers from 2.6.1.  Did you get any messages in the server logs on the
>>> nodes that died?
>>>
>>>>
>>>> Do you think that I should stay with 2.6.1 + the misc-bug.patch from
>>>> Murali?
>>>
>>> There aren't any other significant fixes in 2.6.2 besides support for
>>> the latest Berkeley DB release, and the misc-bug patch that you mention,
>>> so using 2.6.1 shouldn't be a problem for you.  That being said, if the
>>> servers crash for you on 2.6.2, its likely that they will do so with
>>> 2.6.1 and you just haven't hit it yet.  I'd also like to figure out
>>> exactly what is causing the servers to crash.  Can you send your MPI-IO
>>> program to us?
>>>
>>
>> Hi Sam,
>>
>> There is nothing in the server logs! May be tomorrow (it is now 6:30 pm
>> here) I will have more infos from the mpi-io runs I've just submitted.
> 
> Rob thinks this might be related to the ROMIO ad_pvfs bug reported a
> couple days ago, but the even so, corruption on the client shouldn't
> cause the server's to segfault (esp. if the corruption is outside the
> PVFS system interfaces).  If possible, it would be great to get a stack
> trace from one of the crashed servers.

Hi Sam,

How can I get the stack strace ph the pvfs2 server when it dies?
I have run another series of tests with the mpi-io program for another
hour but the none of the servers died! I can add  that when one of the servers
previously died, I've got the following messages from my mpi program
while nothing appears in the pvfs2_server.log file:

=====================================
[E 17:26:51.714686] msgpair failed, will retry: Broken pipe
[E 17:26:51.736877] handle_io_error: flow proto error cleanup started on 
0x6fd870, error_code: -1073741973
[E 17:26:51.737091] handle_io_error: flow proto 0x6fd870 canceled 0 operations, 
will clean up.
[E 17:26:51.737108] handle_io_error: flow proto 0x6fd870 error cleanup 
finished, error_code: -1073741973
[E 17:26:53.734663] msgpair failed, will retry: Connection refused
[E 17:26:55.754647] msgpair failed, will retry: Connection refused
[E 17:26:57.774636] msgpair failed, will retry: Connection refused
[E 17:26:59.794622] msgpair failed, will retry: Connection refused
[E 17:27:01.814610] msgpair failed, will retry: Connection refused
[E 17:27:01.814651] *** msgpairarray_completion_fn: msgpair to server 
tcp://io4:3334 failed: Connection refused
[E 17:27:01.814666] *** Out of retries.
=====================================

-Minh.
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Reply via email to