Re: [Lustre-discuss] I/O errors with NAMD

2010-09-07 Thread Mike Hanby
:47 PM To: Rick Grubin Cc: lustre-discuss@lists.lustre.org Subject: Re: [Lustre-discuss] I/O errors with NAMD On 07/23/2010 06:39 PM, Rick Grubin wrote: > >> On 2010-07-23, at 11:53, Richard Lefebvre wrote: >> >>> If I had some Lustre error, it would give me a clue, but the o

Re: [Lustre-discuss] I/O errors with NAMD

2010-07-23 Thread John Hammond
On 07/23/2010 06:39 PM, Rick Grubin wrote: > >> On 2010-07-23, at 11:53, Richard Lefebvre wrote: >> >>> If I had some Lustre error, it would give me a clue, but the only >>> errors the users get is the following traceback on the >>> application: >>> >>> -

Re: [Lustre-discuss] I/O errors with NAMD

2010-07-23 Thread Rick Grubin
> On 2010-07-23, at 11:53, Richard Lefebvre wrote: > >> If I had some Lustre error, it would give me a clue, but the only errors the >> users get is the following traceback on the application: >> >> --- >> Reason: FATAL ERROR: Er

Re: [Lustre-discuss] I/O errors with NAMD

2010-07-23 Thread Andreas Dilger
On 2010-07-23, at 11:53, Richard Lefebvre wrote: > If I had some Lustre error, it would give me a clue, but the only errors the > users get is the following traceback on the application: > > --- > Reason: FATAL ERROR: Error on write

Re: [Lustre-discuss] I/O errors with NAMD

2010-07-23 Thread Ned Bass6
On Fri, Jul 23, 2010 at 10:53:45AM -0700, Richard Lefebvre wrote: > If I had some Lustre error, it would give me a clue, but the only errors > the users get is the following traceback on the application: > > --- > Reason: FATAL ERROR

Re: [Lustre-discuss] I/O errors with NAMD

2010-07-23 Thread Wojciech Turek
Hi Larry, >From my experience, if the application is doing some I/O and server evicts the node that application is running on this will definitely result in EIO error being send to the application, thus the input/output error message in the standard output of the application. In the case of my clu

Re: [Lustre-discuss] I/O errors with NAMD

2010-07-23 Thread Richard Lefebvre
If I had some Lustre error, it would give me a clue, but the only errors the users get is the following traceback on the application: --- Reason: FATAL ERROR: Error on write to binary file restart/ABCD_les4.95.vel: Interrupted sy

Re: [Lustre-discuss] I/O errors with NAMD

2010-07-23 Thread Larry
There are many kinds of reasons that a server evicts a client, maybe network error, maybe ptlrpcd bug, but according to my experience, the only chance to see the I/O error is running namd in lustre filesystem, I can see some other "evict" events sometimes, but none of them results in I/O error. So

Re: [Lustre-discuss] I/O errors with NAMD

2010-07-23 Thread Wojciech Turek
There is a similar thread on this mailing list: http://groups.google.com/group/lustre-discuss-list/browse_thread/thread/afe24159554cd3ff/8b37bababf848123?lnk=gst&q=I%2FO+error+on+clients# Also there is a bug open which reports similar problem: https://bugzilla.lustre.org/show_bug.cgi?id=23190 On

Re: [Lustre-discuss] I/O errors with NAMD

2010-07-23 Thread Larry
we have the same problem when running namd in lustre sometimes, the console log suggest file lock expired, but I don't know why. On Fri, Jul 23, 2010 at 8:12 AM, Wojciech Turek wrote: > Hi Richard, > > If the cause of the I/O errors is Lustre there will be some message in the > logs. I am seeing

Re: [Lustre-discuss] I/O errors with NAMD

2010-07-22 Thread Wojciech Turek
Hi Richard, If the cause of the I/O errors is Lustre there will be some message in the logs. I am seeing similar problem with some applications that run on our cluster. The symptoms are always the same, just before application crashes with I/O error node gets evicted with a message like that: Lus

Re: [Lustre-discuss] I/O errors with NAMD

2010-07-22 Thread Andreas Dilger
On 2010-07-22, at 14:59, Richard Lefebvre wrote: > I have a problem with the Scalable molecular dynamics software NAMD. It > write restart files once in a while. But sometime the binary write > crashes. The when it crashes is not constant. The only constant thing is > it happens when it writes o

[Lustre-discuss] I/O errors with NAMD

2010-07-22 Thread Richard Lefebvre
Hi, I have a problem with the Scalable molecular dynamics software NAMD. It write restart files once in a while. But sometime the binary write crashes. The when it crashes is not constant. The only constant thing is it happens when it writes on our Lustre file system. When it write on somethin