Re: [Lustre-discuss] WARNING: data corruption issue found in 1.8.x releases

2009-09-15 Thread Johann Lombardi
On Sep 15, 2009, at 8:44 AM, Robin Humble wrote: > as we are about to throw users onto the new system, can I ask for a > quick update pointing us to the current best guess at a workaround/fix > for the 1.8.1 read cache problems please? > > to me it looks like > https://bugzilla.lustre.org/show_bug

Re: [Lustre-discuss] WARNING: data corruption issue found in 1.8.x releases

2009-09-14 Thread Robin Humble
On Thu, Sep 10, 2009 at 12:35:54PM +0200, Johann Lombardi wrote: >We have attached a new patch to bug 20560 which should address your >problem which may happen in rare cases with partial truncates. as we are about to throw users onto the new system, can I ask for a quick update pointing us to the

Re: [Lustre-discuss] WARNING: data corruption issue found in 1.8.x releases

2009-09-11 Thread Oleg Drokin
Hello! On Sep 11, 2009, at 9:33 AM, Aaron Knister wrote: > Is the read cache corruption actually causing on-disk corruption? Or > just in-memory corruption? I'm assuming the write cache corruption > would end up causing the file to become corrupt on disk, but if a > node crashes during a wr

Re: [Lustre-discuss] WARNING: data corruption issue found in 1.8.x releases

2009-09-11 Thread Aaron Knister
Is the read cache corruption actually causing on-disk corruption? Or just in-memory corruption? I'm assuming the write cache corruption would end up causing the file to become corrupt on disk, but if a node crashes during a write then I'm personally not all that bothered by it. On a side note, any

Re: [Lustre-discuss] WARNING: data corruption issue found in 1.8.x releases

2009-09-10 Thread Charles A. Taylor
On Thu, 2009-09-10 at 09:28 +0200, Johann Lombardi wrote: > > > > Can you please post your stack traces into bug 20560 so that we can > > resolve this problem ASAP. > > For the record, we tested this workaround many times on various > clusters and it worked just fine. I see that you have provided

Re: [Lustre-discuss] WARNING: data corruption issue found in 1.8.x releases

2009-09-10 Thread Johann Lombardi
On Sep 10, 2009, at 9:28 AM, Johann Lombardi wrote: > clusters and it worked just fine. I see that you have provided more > data > in bug 20560, we are looking at it. We have attached a new patch to bug 20560 which should address your problem which may happen in rare cases with partial truncates

Re: [Lustre-discuss] WARNING: data corruption issue found in 1.8.x releases

2009-09-10 Thread Johann Lombardi
On Sep 10, 2009, at 8:05 AM, Andreas Dilger wrote: >> At the moment we are not even sure we can run with just >> read_cache_enable=0. We just know that we can't run with them both >> disabled for more than a few minutes with crashing in >> obd_filter_preprw(). > > Can you please post your stack t

Re: [Lustre-discuss] WARNING: data corruption issue found in 1.8.x releases

2009-09-09 Thread Andreas Dilger
On Sep 09, 2009 15:30 -0400, Charles A. Taylor wrote: > > You recommend disabling the read and the write as the settings > > indicate or just the read as the text indicates? > > A clarification would be good here. So far, we have found that our > OSSs crash with the recommended work-around so t

Re: [Lustre-discuss] WARNING: data corruption issue found in 1.8.x releases

2009-09-09 Thread Oleg Drokin
Hello! On Sep 9, 2009, at 2:07 PM, Charles A. Taylor wrote: > Anyway, your email concerned us so we issued the recommended commands > on our OSSs to disable the caching. That promptly crashed two of our > OSSs. We got the servers back up and after fsck'ing (fsck.ext4) all > the OSTs and remo

Re: [Lustre-discuss] WARNING: data corruption issue found in 1.8.x releases

2009-09-09 Thread Charles A. Taylor
FWIW, we seem to be OK with just "read_cache_enable=0". Don't know if that is sufficient to avoid the data corruption or not. It will have to do though because running with "writethrough_cache_enable=0" crashes the OSSs within a few minutes of completing recovery. Charlie Taylor UF HPC Center

Re: [Lustre-discuss] WARNING: data corruption issue found in 1.8.x releases

2009-09-09 Thread Charles A. Taylor
On Wed, 2009-09-09 at 13:23 -0600, Lundgren, Andrew wrote: > Does this need to be run on EACH OSS? Is there a central way to do it on the > MDS? > > You recommend disabling the read and the write as the settings indicate or > just the read as the text indicates? A clarification would be good

Re: [Lustre-discuss] WARNING: data corruption issue found in 1.8.x releases

2009-09-09 Thread Lundgren, Andrew
Does this need to be run on EACH OSS? Is there a central way to do it on the MDS? You recommend disabling the read and the write as the settings indicate or just the read as the text indicates? -Original Message- A patch is under testing and will be included in 1.8.1.1. Until 1.8.1.1

Re: [Lustre-discuss] WARNING: data corruption issue found in 1.8.x releases

2009-09-09 Thread Mervini, Joseph A
.@lists.lustre.org [mailto:lustre-discuss-boun...@lists.lustre.org] On Behalf Of Charles A. Taylor Sent: Wednesday, September 09, 2009 12:07 PM To: Johann Lombardi Cc: lustre-discuss@lists.lustre.org discuss Subject: Re: [Lustre-discuss] WARNING: data corruption issue found in 1.8.x releases Just for the

Re: [Lustre-discuss] WARNING: data corruption issue found in 1.8.x releases

2009-09-09 Thread Charles A. Taylor
Just for the record, we've been running 1.8.1 for a several weeks now with no problems. Well, truthfully, "no problems" is an exaggeration but it is mostly working. We see lots of log messages we are not used to regarding client and server csum differences. Anyway, your email concerned us so

[Lustre-discuss] WARNING: data corruption issue found in 1.8.x releases

2009-09-09 Thread Johann Lombardi
A bug has been identified in the 1.8 releases (1.8.0, 1.8.0.1 & 1.8.1 are impacted) that can cause data corruption on the OSTs. This problem is related to the OSS read cache feature that has been introduced in 1.8.0. This can happen when a bulk read or write request is aborted due to the client b